Group08 members:
Carolina Pina - R20170790
Mariana Camarneiro - R20170744
Matilde Pires - R20170783
Rui Monteiro - R20170796
Vasco Pestana - R20170803
MSc: Data Science and Advanced Analytics - Nova IMS
Course: Machine Learning
2020/2021
In 2048, mission “Newland” had a major goal: send spaceships with thousands of humans to a habitable planet, found some years earlier, as life on Earth was becoming infeasible. The citizens were segregated into three groups: Group A had volunteers, Group B had important people that were paid by the State to participate, and Group C had people who paid to go.
On the new planet, some citizens had incomes that were higher than average, so they started to pay taxes to the Government, which intends to create a predictive model that classifies new residents as someone who has “income higher than average” or “income lower or equal to the average”.
This study aims to develop that model, which should get the best possible performance on a test dataset, by implementing binary classification on 10 100 unseen records. To implement this, several predictive models will be tested, in order to assess their performance and pick the best one. Furthermore, those models will be analysed against 4 slightly different datasets, considering different approaches to deal with the data, as will be explained.
The training dataset is composed of 22 400 observations. The target variable is Income and equals 1 if a specific local has an income higher than the average, and 0 if it is lower or equal to the average. The datasets are mainly composed of socio-economic features, that range from general basic information about the participants (Name, Birthday, Native Continent, Marital Status, Education Level and Years of Education), to specific information about them in the experiment: Citizen ID, Lives With, Base Area, Employment Sector, Role, Working Hours per Week, Money Received and Ticket Price.
To run this Notebook without issues, the user should either fork and clone our GitHub repository available at https://github.com/VascoPestana/ml_2020, or put this Notebook on a folder, with a "Data Folder" inside it, that includes the Train and Test datasets.
Furthermore, the user needs an Anaconda environment with all the used libraries. A yml file is provided on our GitHub repository (link above) for that purpose. The user can create the environment with the group08.yml file, by following these steps:
As an alternative for creating the environment, if the user already has the most "common" libraries (i.e. pandas, numpy, seaborn, sklearn, etc) on his/her own environment, the next code cell can also be uncommented, in order to pip install the "extra" libraries: mlxtend and imbalanced-learn.
# pip install mlxtend
# pip install -U imbalanced-learn
The user does not need to run the entirety of this Notebook at once. The best approach was our baseline model, which is on the "Baseline" and "First Baseline Models" hyperlinks at the top of the Notebook. Those are the most important models to run. The other approaches were tests, to see if we could improve the performance of our predictive model, which did not actually happen. These approaches are described in more detail on the report of our Project.
It is also important to say that the "Data Exploration and pre-processing" phase is common to all approaches.
# Note: The predictive algorithm's functions are imported on the Predictive Modelling section
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from math import ceil
from datetime import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.feature_selection import mutual_info_classif
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro
from scipy.stats import chi2_contingency
from scipy.stats import chi2
from sklearn import preprocessing
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeClassifierCV
from sklearn.preprocessing import RobustScaler
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics.cluster import normalized_mutual_info_score
from scipy.stats import pointbiserialr
from sklearn.model_selection import train_test_split
# For better resolution plots
%config InlineBackend.figure_format = 'retina'
# Setting seaborn style
sns.set()
# To filter warnings
import warnings
warnings.filterwarnings('ignore')
# Get the dataset and check its first rows
df_train = pd.read_excel(r'Data Folder/Train.xlsx')
df_train.head()
df_test = pd.read_excel(r'Data Folder/Test.xlsx')
# Checking data types and nulls in the dataset --> there are no NaNs
df_train.info()
# Birthday is type object, so we're transforming it to type datetime
# For that we must first fix the problem of having February 29 for non-leap years --> turn all February 29 to February 28
# Note: This change will make it possible to transform the column type to datetime, without clearly influencing the age of
# the citizens.
df_train['Birthday'] = df_train['Birthday'].map(lambda x: x.replace("February 29", "February 28"))
# Check if the replacement worked
df_train['Birthday'][df_train['Birthday'].str.contains("February 29")]
# Change the format the date appears and the data type to datetime
df_train['Birthday'] = df_train['Birthday'].map(lambda x: datetime.strptime(x, " %B %d,%Y").date())
df_train['Birthday'] = pd.to_datetime(df_train['Birthday'])
# Same for the test dataset
df_test['Birthday'] = df_test['Birthday'].map(lambda x: x.replace("February 29", "February 28"))
df_test['Birthday'] = df_test['Birthday'].map(lambda x: datetime.strptime(x, " %B %d,%Y").date())
df_test['Birthday'] = pd.to_datetime(df_test['Birthday'])
# Get a descriptive overview of the variables (both metric and non-metric)
df_train.describe(include="all")
# Define Citizen_ID as the index
df_train.set_index("CITIZEN_ID", inplace=True)
# Define Income variable as the target and remove it from the dataframe with the independent variables
target = df_train['Income']
df_train = df_train.drop(['Income'], axis=1)
# Define metric and non-metric datasets
metric = df_train.loc[:, np.array(df_train.dtypes=="int64")]
non_metric = df_train.loc[:,np.array(df_train.dtypes=="object")]
# Same division for test dataset
metric_test = df_test.loc[:, np.array(df_test.dtypes=="int64")]
non_metric_test = df_test.loc[:,np.array(df_test.dtypes=="object")]
# Get only the year from Birthday, so as to get a better visualization of the values
metric['Birthday'] = df_train.Birthday.map(lambda x: x.year)
# Same for test set
metric_test['Birthday'] = df_test.Birthday.map(lambda x: x.year)
# Remove Name from the list of non_metric variables since that gives us no meaning or valuable information and plotting it
# would be useless
non_metric.drop(columns="Name", inplace=True)
# Checking metrics variables' distribution and pairwise relationship
sns.set(style="whitegrid")
# Setting pairgrid
g = sns.PairGrid(metric)
# Pairgrid
mdg = g.map_diag(plt.hist, edgecolor="w", color="peru")
mog = g.map_offdiag(plt.scatter, edgecolor="w", color="peru", s=40)
# Layout
plt.subplots_adjust(top=0.92)
plt.suptitle("Pairwise relationship of metric variables", fontsize=25)
plt.show()
# Barplots for the non-metric variables
sns.set_style("whitegrid")
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(30,40))
ax1=sns.countplot(non_metric["Native Continent"], ax=axes[0,0])
ax2=sns.countplot(non_metric["Lives with"], ax=axes[0,1])
ax3=sns.countplot(non_metric["Marital Status"], ax=axes[1,0])
ax4=sns.countplot(non_metric["Base Area"], ax=axes[1,1])
ax5=sns.countplot(non_metric["Employment Sector"], ax=axes[2,0])
ax6=sns.countplot(non_metric["Education Level"], ax=axes[2,1])
ax7=sns.countplot(non_metric["Role"], ax=axes[3,0])
# ax8=sns.countplot(non_metric["Birthday"], ax=axes[3,1])
ax1.tick_params(labelsize=17)
ax1.set_xlabel(xlabel='Native Continent',fontsize = 19)
ax2.tick_params(labelsize=17)
ax2.set_xlabel(xlabel='Marital Status',fontsize = 19)
ax3.tick_params(labelsize=17)
ax3.set_xticklabels(ax3.get_xticklabels(), rotation=90)
ax3.set_xlabel(xlabel='Lives With',fontsize = 19)
ax4.tick_params(labelsize=17)
ax4.set_xticklabels(ax4.get_xticklabels(), rotation=90)
ax4.set_xlabel(xlabel='Base Area',fontsize = 19)
ax5.tick_params(labelsize=17)
ax5.set_xticklabels(ax5.get_xticklabels(), rotation=90)
ax5.set_xlabel(xlabel='Employment Sector',fontsize = 19)
ax6.tick_params(labelsize=17)
ax6.set_xticklabels(ax6.get_xticklabels(), rotation=90)
ax6.set_xlabel(xlabel='Education Level',fontsize = 19)
ax7.tick_params(labelsize=17)
ax7.set_xticklabels(ax7.get_xticklabels(), rotation=90)
ax7.set_xlabel(xlabel='Role',fontsize = 19)
# ax8.tick_params(labelsize=17)
# ax8.set_xticklabels(ax.get_xticklabels(), rotation=90)
# ax8.set_xlabel(xlabel='Birthday',fontsize = 19)
plt.subplots_adjust(top=0.95,hspace=0.75)
plt.suptitle("Distribution of non-metric variables", fontsize=40)
# Looking at the plots above, we can see the distribution of the values per category, for each non-metric variable.
# With this, we see that there are 3 variables containing "?" as a value, which we understand to be null values.
# Hence, here we are replacing those "?" for null values, to analyze them more efficiently
df_train = df_train.replace('?', np.nan)
# Now, we can already see how many missing values each variable actually has
df_train.isna().sum()
# Correlation matrix for the metric variables
sns.set(style="white")
# Compute the correlation matrix
corr = metric.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(12, 8))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, ax=ax, annot=True)
# Layout
plt.subplots_adjust(top=0.95)
plt.suptitle("Correlation matrix", fontsize=20)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
# Fixing the bug of partially cut-off bottom and top cells
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
# Analysis of the relation between Base Area and Role
pd.set_option('display.max_rows', 500)
df_train.groupby(["Base Area","Role"])["Role"].count()
# Analysis of the relation between Marital Status and Lives with
pd.crosstab(df_train['Marital Status'], df_train['Lives with'], margins=True)
# Check if there is anyone born after the year of this experiment (2048)
len(df_train[(metric.Birthday>2048)])
# Check the oldest year of birth and most recent one
print(metric.Birthday.min(), metric.Birthday.max())
# Check if there are negative amounts of money
df_train[(df_train["Money Received"]<0) | (df_train["Ticket Price"]<0)]
# Check if there is anyone who paid for the ticket and, at the same time, received money to join the experiment
df_train[(df_train["Ticket Price"]!=0) & (df_train["Money Received"]!=0)]
# Check if there is anyone with a certain level of education and years of education that don't match at all
df_train.groupby(["Years of Education","Education Level"])["Years of Education"].mean()
# We did not consider Preschool relevant (in academic terms), so we gave it a more intuitive label
df_train['Education Level'] = df_train['Education Level'].replace('Preschool', 'No Relevant Education')
df_train["Education Level"].unique()
# We also replaced the "Preschool" years of education to 0, instead of 2
df_train['Years of Education'] = df_train['Years of Education'].replace(2, 0)
df_train["Years of Education"].unique()
# Check if there is anyone whose marital status seems incoherent with the person he/she lives with
df_train.groupby(["Marital Status","Lives with"] )["Lives with"].count()
# Check if there are young people with a very large/unusual amount of years of education
df_train_copy = df_train.copy()
df_train_copy["age"] = 2048 - df_train_copy.Birthday.map(lambda x: x.year)
df_train_copy[df_train_copy.Birthday.map(lambda x: 2048-x.year) < (df_train_copy["Years of Education"]+5)]
# Check if there is anyone with more years of education than his/her age
df_train[df_train.Birthday.map(lambda x: 2048-x.year) < (df_train["Years of Education"]+5)]
# Check if unemployed citizens have Role and Working Hours different from 0
df_train[["Employment Sector", "Role", "Working Hours per week"]][df_train["Employment Sector"]=="Unemployed"]
# Check if Never worked citizens have Role and Working Hours different from 0
df_train[["Employment Sector", "Role", "Working Hours per week"]][df_train["Employment Sector"]=="Never Worked"]
df_train["Working Hours per week"][(df_train["Employment Sector"]=="Unemployed") | (df_train["Employment Sector"]=="Never Worked")] = 0
df_train["Role"][(df_train["Employment Sector"]=="Never Worked")] = "No Role"
# Check the changes
df_train[["Employment Sector", "Role", "Working Hours per week"]][(df_train["Employment Sector"]=="Unemployed") | (df_train["Employment Sector"]=="Never Worked")]
# Box plots for the metric variables
sns.set(style="whitegrid")
data = pd.melt(metric)
plot_features = metric.columns
#Prepare figure layout
fig, axes = plt.subplots(1, len(plot_features), figsize=(15,8), constrained_layout=True)
# Draw the boxplots
for i in zip(axes, plot_features):
sns.boxplot(x="variable", y="value", data=data.loc[data["variable"]==i[1]], ax=i[0], color='peru')
i[0].set_xlabel("")
i[0].set_ylabel("")
# Finalize the plot
plt.suptitle("Metric variables' box plots", fontsize=25)
sns.despine(bottom=True)
plt.show()
# Check how many citizens have received more than 120000
df_train[df_train["Money Received"]>120000]
df_train["Working Hours per week"].describe()
# Function to do outlier detection with the IQR method
def out_iqr(data, k=1.5, return_thresholds=False):
#some cutoff to multiply by the iqr
#return_thresholds - True returns the lower and upper bounds; False returns the masked array
# calculate interquartile range
q25, q75 = np.percentile(data, 25, axis=0), np.percentile(data, 75, axis=0)
iqr = q75 - q25
# calculate the outlier cutoff
cut_off = iqr * k
lower, upper = q25 - cut_off, q75 + cut_off
if return_thresholds:
return lower, upper
else: # identify outliers
#return a boolean mask of outliers for a series using interquartile range
return data.apply(lambda x: np.any((x<lower) | (x>upper)), 1)
# Testing with k=3.5
outliers = out_iqr(df_train[['Money Received', 'Ticket Price']], 3.5)
outliers.value_counts()
# Citizens from Groups B and C are always removed using the IQR method!
df_train[~outliers].max()
# Further checking the box plots:
# Check the number of citizens that have less than 7.5 years of education to conclude if they might be outliers
len(df_train[df_train["Years of Education"]<7.5])
# Test if the variable Working Hours per week follows a normal distribution
stat, p = shapiro(df_train["Working Hours per week"])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# Interpretation
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
# Manual removal of outliers by checking the box plots
filters = (
(df_train['Money Received']<120000)
&
(df_train['Ticket Price']<4000)
)
df_train_out = df_train[filters]
target_out = target[filters]
print(round((1-len(df_train_out)/len(df_train))*100,2),'% observations would be removed')
We decided not to eliminate outliers on "Money Received", since when observing the value that was more distant from the rest, we concluded that there were several people with that high value (122999), all having a relatively important role and high hours of work.
Also, we do not consider as relevant the amount of money itself (payed or received), but instead, if the person received or payed to go.
However, we will still try to use the manual removal, later on.
df_train.isna().sum()
df_train1 = df_train.copy()
modes = non_metric.mode().loc[0]
df_train1.fillna(modes, inplace=True)
Using the following code cell, we will try to understand if certain non-metric variables are dependent or not from the ones with missing values.
# Create a function that performs the Chi2 test for independence, to check for association between each variable with no
# missing values and each of the ones that have them
def check_association(col1, col2):
# Contingency table
tab=pd.crosstab(df_train[col1], df_train[col2], margins=False).values
stat, p, dof, expected = chi2_contingency(tab)
# Interpretation of test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))
if abs(stat) >= critical:
print(col1,'and',col2,'are dependent (reject H0).')
else:
print(col1,'and',col2,'are independent (fail to reject H0).')
# Interpretation of p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
print(col1,'and',col2,'are dependent (reject H0).\n')
else:
print(col1,'and',col2,'are independent (fail to reject H0).\n')
# FIRST: Marital Status
check_association('Marital Status','Base Area')
check_association('Marital Status','Employment Sector')
check_association('Marital Status','Role')
# SECOND: Education Level
check_association('Education Level','Base Area')
check_association('Education Level','Employment Sector')
check_association('Education Level','Role')
Seeing both of these variables are associated with the three variables with missing values, using the similarities between people in these categories might be useful to discover their characteristics in the missing variables.
# Copy the dataframe into another to apply the changes there
df_train2 = df_train.copy()
# Create a function to impute the missing values by the mode of the records belonging to the same classes of Marital Status and Education Level
def impute_mode_by_cat(df_train2, col):
exp=df_train2[df_train2[col].isnull()].reset_index()
gr=df_train2.groupby(["Education Level","Marital Status"])[col].agg(pd.Series.mode)
# If there are null values in the group by, replace them by the overall mode of the orginal variable
for i in range(len(gr)):
if len(gr[i])==0:
gr[i]=df_train2[col].mode()[0]
# Define the values of Base Area in the new auxiliary dataset as the mode of the Base Area values for the observations
# with the same level of education and marital status, because at least to some extent, they are more similar than the
# others
for i in range(len(exp)):
for x in range(len(gr)):
if (exp['Education Level'][i]==gr.index[x][0]) and (exp['Marital Status'][i]==gr.index[x][1]):
exp[col][i]=gr[x]
# If it's bimodal (has more than one mode), keep only the first one provided
for i in range(len(exp)):
if type(exp[col][i])!=str:
exp[col][i]=df_train2[col].mode()[0]
# Finally, replace the null values in Base Area, in the original dataset, by the values acquired before
for i in range(len(exp)):
df_train2.loc[exp['CITIZEN_ID'][i],col]=exp[col][i]
return df_train2
# Checking the records of citizens with a null Base Area
df_train2[df_train2['Base Area'].isnull()]
# Imputing them with the method described above
df_train2 = impute_mode_by_cat(df_train2,'Base Area')
# Checking the records of citizens with a null Employment Sector
df_train2[df_train2['Employment Sector'].isnull()]
# Imputing them with the method described above
df_train2 = impute_mode_by_cat(df_train2,'Employment Sector')
# Checking the records of citizens with a null Role
df_train2[df_train2['Role'].isnull()]
# Imputing them with the method described above
df_train2 = impute_mode_by_cat(df_train2,'Role')
# Check if there are no missing values left to impute
df_train2.isna().sum()
# Barplots for the non-metric variables before & after the imputation
sns.set_style("whitegrid")
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(25,25))
axa=sns.countplot(df_train2["Base Area"], ax=axes[0,1])
axb=sns.countplot(df_train2["Employment Sector"], ax=axes[1,1])
axc=sns.countplot(df_train2["Role"], ax=axes[2,1])
ax4=sns.countplot(non_metric["Base Area"], ax=axes[0,0])
ax5=sns.countplot(non_metric["Employment Sector"], ax=axes[1,0])
ax7=sns.countplot(non_metric["Role"], ax=axes[2, 0])
axa.tick_params(labelsize=15)
axa.set_xticklabels(axa.get_xticklabels(), rotation=90)
axa.set_xlabel(xlabel='Base Area',fontsize = 17)
axa.set_ylabel(ylabel='Nr of observations',fontsize = 17)
axa.set(ylim=(0, 21000))
ax4.tick_params(labelsize=15)
ax4.set_xticklabels(ax4.get_xticklabels(), rotation=90)
ax4.set_xlabel(xlabel='Base Area - with missings',fontsize = 17)
ax4.set_ylabel(ylabel='Nr of observations',fontsize = 17)
ax4.set(ylim=(0, 21000))
axb.tick_params(labelsize=15)
axb.set_xticklabels(axb.get_xticklabels(), rotation=90)
axb.set_xlabel(xlabel='Employment Sector',fontsize = 17)
axb.set_ylabel(ylabel='Nr of observations',fontsize = 17)
axb.set(ylim=(0, 17000))
ax5.tick_params(labelsize=15)
ax5.set_xticklabels(ax5.get_xticklabels(), rotation=90)
ax5.set_xlabel(xlabel='Employment Sector - with missings',fontsize = 17)
ax5.set_ylabel(ylabel='Nr of observations',fontsize = 17)
ax5.set(ylim=(0, 17000))
axc.tick_params(labelsize=15)
axc.set_xticklabels(axc.get_xticklabels(), rotation=90)
axc.set_xlabel(xlabel='Role',fontsize = 17)
axc.set_ylabel(ylabel='Nr of observations',fontsize = 17)
axc.set(ylim=(0, 3200))
ax7.tick_params(labelsize=15)
ax7.set_xticklabels(ax7.get_xticklabels(), rotation=90)
ax7.set_xlabel(xlabel='Role - with missings',fontsize = 17)
ax7.set_ylabel(ylabel='Nr of observations',fontsize = 17)
ax7.set(ylim=(0, 3200))
plt.subplots_adjust(top=0.9,hspace=0.9)
plt.suptitle("Distribution of variables with and without missing values", fontsize=30)
Age:
# New variable for Age
df_train2["Age"]=df_train2.Birthday.map(lambda x: datetime.now().year+28-x.year)
# Same for test
df_test["Age"]=df_test.Birthday.map(lambda x: datetime.now().year+28-x.year)
df_train2[["Birthday","Age"]]
Gender:
# New variable for Gender - 1 if it's a male, 0 otherwise
df_train2['Male'] = np.where(df_train2.Name.str.contains('Mrs|Miss'), '0', '1')
# Same for test set
df_test['Male'] = np.where(df_test.Name.str.contains('Mrs|Miss'), '0', '1')
Marital Status:
# Check the values for variable Marital Status
df_train2["Marital Status"].unique()
# In Marital Status, "Married" will join 'Married - Spouse Missing' and 'Married - Spouse in the Army' since we do
# not consider this distintion relevant
df_train2["Marital Status_new"] = df_train2["Marital Status"]
df_train2["Marital Status_new"][(df_train2["Marital Status"].str.contains("Married")==True) & (df_train2["Marital Status"]!="Married")]="Married"
# In Marital Status, join 'Divorced' with 'Separated' in "Divorced or Separated"
df_train2["Marital Status_new"][(df_train2["Marital Status"]=="Separated") | (df_train2["Marital Status"]=="Divorced")]="Divorced or Separated"
# Same for test set
df_test["Marital Status_new"] = df_test["Marital Status"]
df_test["Marital Status_new"][(df_test["Marital Status"].str.contains("Married")==True) & (df_test["Marital Status"]!="Married")]="Married"
df_test["Marital Status_new"][(df_test["Marital Status"]=="Separated") | (df_test["Marital Status"]=="Divorced")]="Divorced or Separated"
df_train2["Marital Status_new"].unique()
Education Level:
# Check the values for variable Education Level
df_train2["Education Level"].unique()
# In Education Level, join all equal periods
df_train2["Education Level_new"] = df_train2["Education Level"]
df_train2["Education Level_new"][(df_train2["Education Level"]=="Middle School - 1st Cycle") |
(df_train2["Education Level"]=="Middle School - 2nd Cycle")|
(df_train2["Education Level"]=="Middle School Complete")]="Middle School"
df_train2["Education Level_new"][(df_train2["Education Level"]=="High School - 1st Cycle") |
(df_train2["Education Level"]=="High School - 2nd Cycle") |
(df_train2["Education Level"]=="High School Complete") | (df_train2["Education Level"]=="High School + PostGraduation")]="High School"
df_train2["Education Level_new"][(df_train2["Education Level"]=="Bachelors + PostGraduation")]="Bachelors"
df_train2["Education Level_new"][(df_train2["Education Level"]=="Professional School + PostGraduation")]="Professional School"
df_train2["Education Level_new"][(df_train2["Education Level"]=="Masters + PostGraduation")]="Masters"
# Same for test set
df_test["Education Level_new"] = df_test["Education Level"]
df_test["Education Level_new"][(df_test["Education Level"]=="Middle School - 1st Cycle") |
(df_test["Education Level"]=="Middle School - 2nd Cycle")|
(df_test["Education Level"]=="Middle School Complete")]="Middle School"
df_test["Education Level_new"][(df_test["Education Level"]=="High School - 1st Cycle") |
(df_test["Education Level"]=="High School - 2nd Cycle") |
(df_test["Education Level"]=="High School Complete") | (df_test["Education Level"]=="High School + PostGraduation")]="High School"
df_test["Education Level_new"][(df_test["Education Level"]=="Bachelors + PostGraduation")]="Bachelors"
df_test["Education Level_new"][(df_test["Education Level"]=="Professional School + PostGraduation")]="Professional School"
df_test["Education Level_new"][(df_test["Education Level"]=="Masters + PostGraduation")]="Masters"
df_train2["Education Level_new"].unique()
df_train2["Years of Education"][df_train2["Education Level_new"]=='High School']
# PostGraduation is a binary that says if the citizen has a Post Graduation or not
df_train2["PostGraduation"] = df_train2["Education Level"].map(lambda x: '1' if "+" in x else '0')
# Same for test set
df_test["PostGraduation"] = df_test["Education Level"].map(lambda x: '1' if "+" in x else '0')
# New binary variable related to Higher Education
# (includes at least one of the following: Post Graduation, Bachelors, Masters, PhD)
df_train2['Higher Education'] = np.where(df_train2['Years of Education']>12, '1', '0')
# Same for test set
df_test['Higher Education'] = np.where(df_test['Years of Education']>12, '1', '0')
Capital:
# New binary variable that tells us if each person lives in the capital city or not
# (after analysing and visualizing data, we assume Northbury to be a kinf of capital city/main base for the new planet)
df_train2['Capital'] = np.where(df_train2['Base Area']=='Northbury', '1', '0')
# Same for test set
df_test['Capital'] = np.where(df_test['Base Area']=='Northbury', '1', '0')
Groups on the mission:
# New binary variable to determine whether the person belongs to Group B
# (people who were payed to participate in the mission)
df_train2['Group B'] = np.where(df_train2['Money Received']!=0, '1', '0')
# Same for test set
df_test['Group B'] = np.where(df_test['Money Received']!=0, '1', '0')
# New binary variable to determine whether the person belongs to Group C
# (people who payed to participate in the mission)
df_train2['Group C']=np.where(df_train2['Ticket Price']!=0, '1', '0')
# Same for test set
df_test['Group C']=np.where(df_test['Ticket Price']!=0, '1', '0')
Employment Sector:
# Function to join similar employment sectors
def sectors(a):
if 'Private Sector' in a:
return 'Private Sector'
elif 'Public Sector' in a:
return 'Public Sector'
# We don't join the Self-Employeds bc the 'Company' ones have much more 1s on the target than the 'Individual'
elif 'Self-Employed (Individual)' in a:
return a
elif 'Self-Employed (Company)' in a:
return a
else:
return 'Unemployed / Never Worked'
# In Employment Sector, join all equal sectors
df_train2['Employment Sector (simplified)'] = df_train2['Employment Sector'].map(sectors)
# Same for test set
df_test['Employment Sector (simplified)'] = df_test['Employment Sector'].map(sectors)
# New binary variable to determine whether the person belongs to the Government
df_train2['Government'] = df_train2['Employment Sector'].map(lambda x: '1' if 'Government' in x else '0')
# Same for test set
df_test['Government'] = df_test['Employment Sector'].map(lambda x: '1' if 'Government' in x else '0')
Ordinal variable with Money Received and Ticket Price:
# Money Relevance serves to order the citizens by importance, according to how much many they received or paid; Group A,
# with people that didn't receive nor pay, is considered to be in between Groups B (received) and C (paid)
Median_Money_Received = df_train2["Money Received"][df_train2["Money Received"]>0].median()
Median_Ticket_Price = df_train2["Ticket Price"][df_train2["Ticket Price"]>0].median()
df_train2['Money Relevance']='0'
df_train2['Money Relevance'][df_train2["Money Received"]> Median_Money_Received]='1'
df_train2['Money Relevance'][(df_train2["Money Received"]<= Median_Money_Received) & (df_train2["Money Received"]>0)]='2'
df_train2['Money Relevance'][df_train2["Ticket Price"]> Median_Ticket_Price]='5'
df_train2['Money Relevance'][(df_train2["Ticket Price"]<= Median_Ticket_Price) & (df_train2["Ticket Price"]>0)]='4'
df_train2['Money Relevance'][(df_train2["Ticket Price"]== 0) & (df_train2["Money Received"]==0)]='3'
# Same for test set
Median_Money_Received = df_test["Money Received"][df_test["Money Received"]>0].median()
Median_Ticket_Price = df_test["Ticket Price"][df_test["Ticket Price"]>0].median()
df_test['Money Relevance']='0'
df_test['Money Relevance'][df_test["Money Received"]> Median_Money_Received]='1'
df_test['Money Relevance'][(df_test["Money Received"]<= Median_Money_Received) & (df_test["Money Received"]>0)]='2'
df_test['Money Relevance'][df_test["Ticket Price"]> Median_Ticket_Price]='5'
df_test['Money Relevance'][(df_test["Ticket Price"]<= Median_Ticket_Price) & (df_test["Ticket Price"]>0)]='4'
df_test['Money Relevance'][(df_test["Ticket Price"]== 0) & (df_test["Money Received"]==0)]='3'
df_train2[["Ticket Price", "Money Received", "Money Relevance"]]
Interaction between Working hours and Years of Education:
# Interaction between these two features: does working more hours have more impact on income with more year of education?
df_train2["Working hours * Years of Education"] = df_train2["Working Hours per week"] * df_train2["Years of Education"]
# Same for test set
df_test["Working hours * Years of Education"] = df_test["Working Hours per week"] * df_test["Years of Education"]
Money / Years of Education:
# Money received per year of education
df_train2['Money / YE'] = 0
df_train2['Money / YE'][df_train2["Years of Education"]!=0]=round(df_train2["Money Received"] / df_train2["Years of Education"], 2)
# Same for test set
df_test['Money / YE'] = 0
df_test['Money / YE'][df_test["Years of Education"]!=0]=round(df_test["Money Received"] / df_test["Years of Education"], 2)
Log 10 of Money Received and Ticket Price:
# Log 10 of Money Received and Ticket Price, to deal with high vales on those variables
df_train2['Log 10 of Money Received']=df_train2['Money Received'].map(lambda x: math.log10(x) if x!=0 else 0)
df_train2['Log 10 of Ticket Price']=df_train2['Ticket Price'].map(lambda x: math.log10(x) if x!=0 else 0)
# Same for test set
df_test['Log 10 of Money Received']=df_test['Money Received'].map(lambda x: math.log10(x) if x!=0 else 0)
df_test['Log 10 of Ticket Price']=df_test['Ticket Price'].map(lambda x: math.log10(x) if x!=0 else 0)
# Defining the dataframe of the initial categorical variables
initial_categorical_vars = df_train2.loc[:, np.array(df_train2.dtypes=="object")]
initial_categorical_vars.drop(columns='Name', inplace=True)
# Same for test
initial_categorical_vars_test = df_test.loc[:, np.array(df_test.dtypes=="object")]
initial_categorical_vars_test.drop(columns='Name', inplace=True)
initial_categorical_vars.head()
# Definition of a function to evaluate the capacity of each non-metric variable to distinguish between the 0s and 1s on the
# target variable
def bar_charts_categorical(df, feature, dep_var):
cont_tab = pd.crosstab(df[feature], dep_var, margins = True)
categories = cont_tab.index[:-1]
fig = plt.figure(figsize=(15, 5))
plt.subplot(121)
p1 = plt.bar(categories, cont_tab.iloc[:-1, 0].values, 0.55, color="peru")
p2 = plt.bar(categories, cont_tab.iloc[:-1, 1].values, 0.55, bottom=cont_tab.iloc[:-1, 0], color="b")
plt.legend((p2[0], p1[0]), ('$y_i=1$', '$y_i=0$'))
plt.title("Frequency bar chart")
plt.xlabel(feature)
plt.ylabel("$Frequency$")
plt.xticks(rotation=90)
# Auxiliary data
obs_pct = np.array([np.divide(cont_tab.iloc[:-1, 0].values, cont_tab.iloc[:-1, 2].values),
np.divide(cont_tab.iloc[:-1, 1].values, cont_tab.iloc[:-1, 2].values)])
plt.subplot(122)
p1 = plt.bar(categories, obs_pct[0], 0.55, color="peru")
p2 = plt.bar(categories, obs_pct[1], 0.55, bottom=obs_pct[0], color="b")
plt.legend((p2[0], p1[0]), ('$y_i=1$', '$y_i=0$'))
plt.title("Proportion bar chart")
plt.xlabel(feature)
plt.ylabel("$p$")
plt.xticks(rotation=90)
plt.show()
# Check the graphs for each categorical feature
initial_categorical_features = initial_categorical_vars.columns
for i in initial_categorical_features:
bar_charts_categorical(initial_categorical_vars, i, target)
After analyzing the previous graphs, we concluded that some of the original non-metric variables have many categories, and some are not relevant enough to keep, because they are redundant and do not add useful information and interpretability. This is the case of Marital Status, Education Level and Employment Sector. We still kept the variables created during Feature Engineering that "came" from these three.
Base Area was dropped because it had dozens of categories and only one actually had many citizens: Northbury. Thus, we decided to keep only the feature Capital.
Finally, Lives with was also dropped because it was too redundant with the Marital Status variables, and didn't seem to add any useful information in distinguishing 0s and 1s on the target.
# Drop the features referenced above
df_features = initial_categorical_vars.drop(columns=['Marital Status', 'Lives with', 'Base Area',
'Education Level', 'Employment Sector'])
# Same for test
df_features_test = initial_categorical_vars_test.drop(columns=['Marital Status', 'Lives with', 'Base Area',
'Education Level', 'Employment Sector'])
df_features.head()
Encoding the non-metric features:
pd.set_option('display.max_columns', None)
# Some variables are already binary and shouldn't be One Hot Encoded
df_ohc = df_features.drop(columns=['Male', 'Higher Education', 'Capital', 'Group B', 'Group C',
'PostGraduation','Government']).copy()
# Use OneHotEncoder to encode the non-metric features. Get feature names and create a DataFrame
# with the one-hot encoded non-metric features (pass feature names)
ohc = OneHotEncoder(sparse=False, dtype=int)
ohc_feat = ohc.fit_transform(df_ohc)
ohc_feat_names = ohc.get_feature_names()
ohc_df = pd.DataFrame(ohc_feat, index=df_ohc.index, columns=ohc_feat_names)
# Same for test
df_ohc_test = df_features_test.drop(columns=['Male', 'Higher Education', 'Capital', 'Group B', 'Group C',
'PostGraduation','Government']).copy()
ohc_test = OneHotEncoder(sparse=False, dtype=int)
ohc_feat_test = ohc_test.fit_transform(df_ohc_test)
ohc_feat_names_test = ohc.get_feature_names()
ohc_df_test = pd.DataFrame(ohc_feat_test, index=df_ohc_test.index, columns=ohc_feat_names_test)
ohc_df
We will now assess the feature importance of all binaries gotten in the encoding, with a decision tree, to check the classes with the lowest feature importance on each non-metric feature. We also checked the previous bar plots, on the beggining of the Feature Selection stage, to better visualize the frequency and proportion of each class.
This is done because if we use the drop='first' on OneHotEncoder, many important classes are dropped (e.g. level 1 on Money Relevance), so we will manually drop one class per variable.
# Feature importance using the split criteria 'Gini'
gini_importance = DecisionTreeClassifier().fit(ohc_df, target).feature_importances_
# Feature importance using the split criteria 'Entropy'
entropy_importance = DecisionTreeClassifier(criterion='entropy').fit(ohc_df, target).feature_importances_
# Plotting the feature importances for both criteria
zippy = pd.DataFrame(zip(gini_importance, entropy_importance), columns = ['gini','entropy'])
zippy['col'] = ohc_df.columns
tidy = zippy.melt(id_vars='col').rename(columns=str.title)
tidy.sort_values(['Value'], ascending = False, inplace = True)
plt.figure(figsize=(15,20))
sns.barplot(y='Col', x='Value', hue='Variable', data=tidy)
# Drop the "worst" class of each feature
ohc_df.drop(columns=['x0_Oceania', 'x1_No Role', 'x2_Widow', 'x3_No Relevant Education',
'x4_Unemployed / Never Worked', 'x5_4'], inplace=True)
# Reassigning df to contain ohc variables
non_metric_binary = pd.concat([df_features.drop(columns=df_ohc.columns), ohc_df], axis=1)
non_metric_binary.head()
# Change the 'object' binaries to 'int'
non_metric_binary.loc[:, np.array(non_metric_binary.dtypes=="object")] = non_metric_binary.loc[:, np.array(non_metric_binary.dtypes=="object")].astype(int)
# Feature importance using the split criteria 'Gini'
gini_importance = DecisionTreeClassifier().fit(non_metric_binary, target).feature_importances_
# Feature importance using the split criteria 'Entropy'
entropy_importance = DecisionTreeClassifier(criterion='entropy').fit(non_metric_binary, target).feature_importances_
# Plotting the feature importances for both criteria
zippy = pd.DataFrame(zip(gini_importance, entropy_importance), columns = ['gini','entropy'])
zippy['col'] = non_metric_binary.columns
tidy = zippy.melt(id_vars='col').rename(columns=str.title)
tidy.sort_values(['Value'], ascending = False, inplace = True)
plt.figure(figsize=(15,20))
sns.barplot(y='Col', x='Value', hue='Variable', data=tidy)
Ranking on this method:
# Random forest instance, indicating the number of trees
rf = RandomForestClassifier(n_estimators = 100, random_state=0, n_jobs=-1)
sel = SelectFromModel(rf)
# SelectFromModel object from sklearn to automatically select the features
sel.fit(non_metric_binary, target)
# Features with an importance greater than the mean importance of all the features
sel.get_support()
rf.fit(non_metric_binary, target)
# Ranking by feature importances
df_imp = pd.DataFrame(rf.feature_importances_, non_metric_binary.columns).reset_index().rename(columns={'index':'binary_variables', 0:'feature_importance'})
df_imp.sort_values('feature_importance', ascending=False)
# Get the selected features on a list and count them
selected_feat = non_metric_binary.columns[(sel.get_support())]
len(selected_feat)
# Feature's names (note: without any order of importance)
print(selected_feat)
Ranking on this method:
# Getting a new dataframe to implement this method
df_features_target = df_features.copy()
df_features_target["Target"] = target
df_features_target
df_features.head(3)
from sklearn.feature_selection import SelectKBest, chi2 # for chi-squared feature selection
sf = SelectKBest(chi2, k='all')
sf_fit = sf.fit(non_metric_binary, target)
# Plot the scores
datset = pd.DataFrame()
datset['feature'] = non_metric_binary.columns[range(len(sf_fit.scores_))]
datset['scores'] = sf_fit.scores_
datset = datset.sort_values(by='scores', ascending=False)
plt.figure(figsize=(10,15))
sns.barplot(datset['scores'], datset['feature'], color='peru')
sns.set_style('whitegrid')
plt.ylabel('Categorical Feature', fontsize=18)
plt.xlabel('Score', fontsize=18)
plt.show()
# When using this method, higher score values mean more relevance to explain the dependent variable
Ranking on this method:
feat = []
mi = []
for i in non_metric_binary.columns:
feat.append(i)
a = np.array(non_metric_binary[i])
b = np.array(target)
# Mutual information of 0.69, expressed in nats
mi.append(mutual_info_classif(a.reshape(-1,1), b, discrete_features = True)[0])
# Plot the MI
feat_mi=pd.DataFrame([feat, mi]).T.sort_values(by=1, ascending=False).reset_index(drop=True)
plt.figure(figsize=(10,15))
sns.barplot(x=1, y=0, data=feat_mi, color='peru')
sns.set_style('whitegrid')
plt.ylabel('Categorical Feature', fontsize=18)
plt.xlabel('Mutual Information', fontsize=18)
plt.show()
Ranking on this method:
# Selection based on the previously used methods
non_metric_selected = non_metric_binary[['Male', 'Higher Education', 'Group B', 'x1_Management', 'x1_Professor',
'x2_Married', 'x2_Single', 'x3_Bachelors', 'x3_Masters', 'x5_1', 'x5_3', 'x5_5']]
# Checking for redundant variables
print('Normalized mutual information between binary variables (0-1):\n')
for i in non_metric_selected.columns:
for j in non_metric_selected.columns:
normal_mi = round(normalized_mutual_info_score(non_metric_selected[i], non_metric_selected[j]), 3)
if i == j: #if equals to 1
pass
elif normal_mi > 0.5:
print(i, 'and', j, ':', normal_mi)
# x3_3 was selected more times than Group B, so we will keep x3_3
non_metric_selected.drop(columns='Group B', inplace=True)
df_train2.info()
metric = df_train2.loc[:,(np.array(df_train2.dtypes=="int64")) | (np.array(df_train2.dtypes=="float64"))]
# Normalizing using min max
min_max_scaler = preprocessing.MinMaxScaler()
metric_scaled = min_max_scaler.fit_transform(metric.values)
stand_metric= pd.DataFrame(metric_scaled, columns=metric.columns, index=metric.index)
# # Normalizing using RobustScaler
# robust = RobustScaler().fit(metric)
# robust_metric= robust.transform(metric)
# stand_metric= pd.DataFrame(robust_metric, columns=metric.columns, index=metric.index)
# Start by checking correlations
sns.set(style="white")
# Compute the correlation matrix
corr = stand_metric.corr() #Getting correlation of numerical variables
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool) #Return an array of zeros (Falses) with the same shape and type as a given array
mask[np.triu_indices_from(mask)] = True #The upper-triangle array is now composed by True values
# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(20, 12))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True) #Make a diverging palette between two HUSL colors. Return a matplotlib colormap object.
# Draw the heatmap with the mask and correct aspect ratio
#show only corr bigger than 0.6 in absolute value
sns.heatmap(corr[(corr>=.7) | (corr<=-.7)], mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, ax=ax)
# Layout
plt.subplots_adjust(top=0.95)
plt.suptitle("Correlation matrix", fontsize=20)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
# Fixing the bug of partially cut-off bottom and top cells
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
# Correlation between Money Received and Log 10 of Money Received
round(corr['Money Received']['Log 10 of Money Received'], 3)
# No of features
nof_list=np.arange(1,len(stand_metric.columns)+1)
high_score=0
# Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
X_train, X_test, y_train, y_test = train_test_split(stand_metric,target, test_size = 0.3, random_state = 0)
model = LogisticRegression()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
rfe = RFE(estimator = model, n_features_to_select = 7)
X_rfe = rfe.fit_transform(X = stand_metric, y = target)
model = LogisticRegression().fit(X = X_rfe,y = target)
selected_features = pd.Series(rfe.support_, index = stand_metric.columns)
# Features selected with RFE
selected_features
def plot_importance(coef,name):
imp_coef = coef.sort_values()
plt.figure(figsize=(8,10))
imp_coef.plot(kind = "barh", color="peru")
plt.title("Feature importance using " + name + " Model")
plt.show()
reg = LassoCV()
reg.fit(X=stand_metric, y=target)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X = stand_metric,y = target))
coef = pd.Series(reg.coef_, index = stand_metric.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
plot_importance(coef,'Lasso')
ridge = RidgeClassifierCV().fit(X = stand_metric,y = target)
coef_ridge = pd.Series(ridge.coef_[0], index = stand_metric.columns)
def plot_importance(coef,name):
imp_coef = coef.sort_values()
plt.figure(figsize=(8,10))
imp_coef.plot(kind = "barh", color="peru")
plt.title("Feature importance using " + name + " Model")
plt.show()
plot_importance(coef_ridge,'RidgeClassifier')
model = LogisticRegression()
# Stop when all features have been selected, scoring is "accuracy"
forward = SFS(model, k_features=9, forward=True, scoring="accuracy", cv = None)
forward.fit(stand_metric, target)
# Checking the features added at each step
forward_table = pd.DataFrame.from_dict(forward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
forward_table
# Iteration with the highest accuracy
forward_table_max = forward_table['avg_score'].max()
forward_table_max
# Feature's names
forward_table[forward_table['avg_score']==forward_table_max]['feature_names'].values
# Stop when only one feature remains, scoring is "accuracy"
backward = SFS(model, k_features=1, forward=False, scoring="accuracy", cv = None)
backward.fit(stand_metric, target)
# Checking the features removed at each step
backward_table = pd.DataFrame.from_dict(backward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
backward_table
# Iteration with the highest accuracy
backward_table_max = backward_table['avg_score'].max()
backward_table_max
# Feature's names
backward_table[backward_table['avg_score']==backward_table_max]['feature_names'].values
# Drop the metric features that should not be selected
stand_metric.drop(columns=['Working Hours per week', 'Money / YE', 'Log 10 of Money Received',
'Log 10 of Ticket Price'], inplace=True)
# Dataframe with all features of all types
all_selected_variables = pd.concat([non_metric_selected, stand_metric], axis=1)
all_selected_variables.head()
Now we will repeat the forward and backward selection, this time with metric and non-metric variables mixed.
Forward:
# The model was defined above (LogisticRegression)
forward = SFS(model, k_features=16, forward=True, scoring="accuracy", cv = None)
forward.fit(all_selected_variables, target)
forward_table = pd.DataFrame.from_dict(forward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
forward_table
forward_table_max = forward_table['avg_score'].max()
forward_table_max
forward_table[forward_table['avg_score']==forward_table_max]['feature_names'].values
Backward:
backward = SFS(model, k_features=1, forward=False, scoring="accuracy", cv = None) #floating=False
backward.fit(all_selected_variables, target)
backward_table = pd.DataFrame.from_dict(backward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
backward_table
backward_table_max = backward_table['avg_score'].max()
backward_table_max
backward_table[backward_table['avg_score']==backward_table_max]['feature_names'].values
Mantaining the variables that appear on both the forward and backward selections:
non_metric_bf = non_metric_selected.drop(columns=['Higher Education', 'x3_Bachelors', 'x3_Masters', 'x5_5'])
all_selected_variables.drop(columns=['Higher Education', 'x3_Bachelors', 'x3_Masters', 'x5_5'], inplace=True)
# This is done to answer the question: is there redundancy between any metric and non-metric features?
print('Point biserial between binary and metric variables:\n')
for i in non_metric_bf.columns:
for j in stand_metric.columns:
pb = pointbiserialr(non_metric_bf[i], stand_metric[j])
if abs(pb[0]) > 0.5:
print(i, 'and', j, ':', round(pb[0], 3))
all_selected_variables.columns
# Selecting the same columns for test set
all_variables_test = pd.concat([df_test, ohc_df_test], axis=1)
test = all_variables_test[['Male', 'x1_Management', 'x1_Professor', 'x2_Married', 'x2_Single',
'x5_1', 'x5_3', 'Years of Education', 'Money Received', 'Ticket Price',
'Age', 'Working hours * Years of Education']]
X_train, X_val, y_train, y_val = train_test_split(all_selected_variables,
target,
test_size = 0.3,
random_state = 42,
shuffle=True,
stratify=target)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
import time
from sklearn.tree import export_graphviz
import graphviz
import pydotplus
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from collections import OrderedDict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from numpy import mean
from numpy import std
from numpy import arange
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
from sklearn.svm import SVC
# Functions to be used in all models, to assess them
def metrics(y_train, pred_train , y_val, pred_val):
print('_____________________________________')
print(' TRAIN ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_train, pred_train))
print(confusion_matrix(y_train, pred_train)) #true neg and true pos, false positives and false neg
print('__________________________+_________')
print(' VALIDATION ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_val, pred_val))
print(confusion_matrix(y_val, pred_val))
def avg_score(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the test
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
avg_iter = round(np.mean(n_iter),1)
std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val), str(avg_iter) + '+/-' + str(std_iter)
def show_results(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val, avg_iter = avg_score(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val, avg_iter
count+=1
return df
# For the models that don't have the n_iter attribute
def avg_score_1(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the validation
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
#n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
#avg_iter = round(np.mean(n_iter),1)
#std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val)
#, str(avg_iter) + '+/-' + str(std_iter)
def show_results_1(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val = avg_score_1(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val
count+=1
return df
# Function to plot Decision Trees
def plot_tree(model_tree):
dot_data = export_graphviz(model_tree,
feature_names=X_train.columns,
class_names=["Income lower or equal to avg", "Income higher than avg"],
filled=True)
pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.set_size('"20,20"')
return graphviz.Source(pydot_graph.to_string())
# Function to calculate AUC for each parameter option defined below (max_depth, max_features, min_samples_split, etc)
def calculate_AUC(interval, x_train, x_val, y_train, y_val, parameter, max_depth = None):
train_results = []
val_results = []
for value in interval:
if (parameter == 'max_depth'):
dt = DecisionTreeClassifier(max_depth = value, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'max_features'):
dt = DecisionTreeClassifier(max_features = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_split'):
dt = DecisionTreeClassifier(min_samples_split = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_leaf'):
dt = DecisionTreeClassifier(min_samples_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_weight_fraction_leaf'):
dt = DecisionTreeClassifier(min_weight_fraction_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_impurity_decrease'):
dt = DecisionTreeClassifier(min_impurity_decrease = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
train_pred = dt.predict(x_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous train results
train_results.append(roc_auc)
y_pred = dt.predict(x_val)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous validation results
val_results.append(roc_auc)
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best validation value is ',interval[value_val])
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(interval, train_results, 'b', label="Train AUC")
line2, = plt.plot(interval, val_results, 'r', label="Validation AUC")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("AUC score")
plt.xlabel(str(parameter))
plt.show()
Note: Parameters in decision trees don't really improve performance, they're meant to control overfitting.
dt_entropy = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train)
dt_gini = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Gini','Entropy'])
show_results_1(df,dt_gini, dt_entropy)
dt_random = DecisionTreeClassifier(splitter = 'random').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['best','random'])
show_results_1(df,dt_gini, dt_random)
We will now check the best parameter specifications for each interval of values, thus, we will see at what point the AUC score reaches its maximum on the validation set.
# First, check max_depth
max_depths = np.linspace(1, 15, 15, endpoint=True)
calculate_AUC(max_depths, X_train, X_val, y_train, y_val, 'max_depth')
dt_depth10 = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train)
dt_depth6 = DecisionTreeClassifier(max_depth = 6).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['full','depth10','depth6'])
show_results_1(df,dt_gini, dt_depth10,dt_depth6)
# With more depth comes more overfitting
max_features = list(range(1,len(X_train.columns)))
calculate_AUC(max_features, X_train, X_val, y_train, y_val,'max_features', 10)
min_samples_split = list(range(10,600))
calculate_AUC(min_samples_split, X_train, X_val, y_train, y_val,'min_samples_split', 10)
dt_min17 = DecisionTreeClassifier(min_samples_split = 17).fit(X_train, y_train)
dt_min100 = DecisionTreeClassifier(min_samples_split = 100).fit(X_train, y_train)
dt_min350 = DecisionTreeClassifier(min_samples_split = 350).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['dt_min17','dt_min100','dt_min350'])
show_results_1(df, dt_min17, dt_min100, dt_min350)
# With less min_samples_split, comes more overfitting; 350 has, in fact, less overfitting
min_samples_leaf = list(range(10,600))
calculate_AUC(min_samples_leaf, X_train, X_val, y_train, y_val,'min_samples_leaf', 10)
dt_min_leaf38 = DecisionTreeClassifier(min_samples_leaf = 38).fit(X_train, y_train)
dt_min_leaf220 = DecisionTreeClassifier(min_samples_leaf = 220).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min leaf 38','Min leaf 220'])
show_results_1(df, dt_gini, dt_min_leaf38, dt_min_leaf220)
# More useful for imbalanced datasets!
min_weight_fraction_leaf = np.linspace(0, 0.3, 250, endpoint=True)
calculate_AUC(min_weight_fraction_leaf, X_train, X_val, y_train, y_val,'min_weight_fraction_leaf', 10)
dt_min_weight_1 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.002).fit(X_train, y_train)
dt_min_weight_2 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.05).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min weight small','Min weight med'])
show_results_1(df, dt_gini, dt_min_weight_1, dt_min_weight_2)
min_impurity_decrease = np.linspace(0, 0.05, 500, endpoint=True)
calculate_AUC(min_impurity_decrease, X_train, X_val, y_train, y_val,'min_impurity_decrease', 10)
dt_impurity01 = DecisionTreeClassifier(min_impurity_decrease=0.01).fit(X_train, y_train)
dt_impurity0001 = DecisionTreeClassifier(min_impurity_decrease=0.0001).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Baseline','dt_impurity01','dt_impurity0001'])
show_results_1(df,dt_gini, dt_impurity01,dt_impurity0001)
Now, we will check which is the best ccp_alpha value.
dt_alpha = DecisionTreeClassifier(random_state=42)
path = dt_alpha.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize = (10,10))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha", fontsize=15)
ax.set_ylabel("total impurity of leaves", fontsize=15)
ax.set_title("Total Impurity vs effective alpha for training set", fontsize=15)
# The function below only accepts values higher than 0
ccp_alphas=ccp_alphas[ccp_alphas>0]
trees = []
for ccp_alpha in ccp_alphas:
dt_alpha = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha).fit(X_train, y_train)
trees.append(dt_alpha)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(trees[-1].tree_.node_count, ccp_alphas[-1]))
trees = trees[:-1]
ccp_alphas = ccp_alphas[:-1]
train_scores = [tree.score(X_train, y_train) for tree in trees]
val_scores = [tree.score(X_val, y_val) for tree in trees]
fig, ax = plt.subplots(figsize = (10,10))
ax.set_xlabel("alpha", fontsize=15)
ax.set_ylabel("accuracy", fontsize=15)
ax.set_title("Accuracy vs alpha for training and validation sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, val_scores, marker='o', label="validation", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(val_scores)
best_model = trees[index_best_model]
print('ccp_alpha of best model: ',trees[index_best_model])
print('_____________________________________________________________')
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Validation accuracy of best model: ',best_model.score(X_val, y_val))
The following cell is composed by 4 decision trees with different combinations of parameters.
dt_t1=DecisionTreeClassifier(min_impurity_decrease=0.0001,max_depth = 6,min_samples_split = 350,
min_weight_fraction_leaf = 0.002,random_state=42).fit(X_train, y_train)
dt_t2=DecisionTreeClassifier(max_depth = 6,min_weight_fraction_leaf = 0.002,random_state=42).fit(X_train, y_train)
dt_t3=DecisionTreeClassifier(min_samples_split = 350,min_weight_fraction_leaf = 0.002,
random_state=42).fit(X_train, y_train)
dt_t4=DecisionTreeClassifier(max_depth = 6,min_samples_split = 350,
min_weight_fraction_leaf = 0.002,random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t1.score(X_train, y_train))
print('Validation accuracy:',dt_t1.score(X_val, y_val))
print('Train accuracy:',dt_t2.score(X_train, y_train))
print('Validation accuracy:',dt_t2.score(X_val, y_val))
print('Train accuracy:',dt_t3.score(X_train, y_train))
print('Validation accuracy:',dt_t3.score(X_val, y_val))
print('Train accuracy:',dt_t4.score(X_train, y_train))
print('Validation accuracy:',dt_t4.score(X_val, y_val))
# Also creating the tree with the best ccp_alpha
dt_t5=DecisionTreeClassifier(ccp_alpha=0.000159, random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t5.score(X_train, y_train))
print('Validation accuracy:',dt_t5.score(X_val, y_val))
# Check: does changing the threshold improve, or not, the accuracy?
threshold = 0.4
predicted_proba = dt_5.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
# To build the ROC curves
prob_model1 = dt_t1.predict_proba(X_val)
prob_model2 = dt_t2.predict_proba(X_val)
prob_model3 = dt_t3.predict_proba(X_val)
prob_model4 = dt_t4.predict_proba(X_val)
prob_model5 = dt_t5.predict_proba(X_val)
fpr_1, tpr_1, thresholds_1 = roc_curve(y_val, prob_model1[:, 1])
fpr_2, tpr_2, thresholds_2 = roc_curve(y_val, prob_model2[:, 1])
fpr_3, tpr_3, thresholds_3 = roc_curve(y_val, prob_model3[:, 1])
fpr_4, tpr_4, thresholds_4 = roc_curve(y_val, prob_model4[:, 1])
fpr_5, tpr_5, thresholds_5 = roc_curve(y_val, prob_model5[:, 1])
plt.plot(fpr_1, tpr_1, label="ROC Curve model 1")
plt.plot(fpr_2, tpr_2, label="ROC Curve model 2")
plt.plot(fpr_3, tpr_3, label="ROC Curve model 3")
plt.plot(fpr_4, tpr_4, label="ROC Curve model 4")
plt.plot(fpr_5, tpr_5, label="ROC Curve model 5")
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
The best one is decision tree 5 (dt_t5).
labels_train = dt_t5.predict(X_train)
labels_val = dt_t5.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Check the complexity of the "best" tree
print('The "best" tree has a depth of ' + str(dt_t5.get_depth()) + ', ' + str(dt_t5.tree_.node_count) +
' nodes and a total of ' + str(dt_t5.get_n_leaves()) + ' leaves.')
First, generate the OOB error rate against the n_estimators, to check the ideal number of trees a random forest should have, to minimize the out-of-bag error.
ensemble_clfs = [
("RandomForestClassifier, max_features='auto'",
RandomForestClassifier(oob_score=True,
max_features='auto',
random_state=42)),
("RandomForestClassifier, max_features='log2'",
RandomForestClassifier(max_features='log2',
oob_score=True,
random_state=42)),
("RandomForestClassifier, max_features=6",
RandomForestClassifier(max_features=6,
oob_score=True,
random_state=42)),
("RandomForestClassifier, max_features=None",
RandomForestClassifier(max_features=None,
oob_score=True,
random_state=42))
]
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of n_estimators values to explore
min_estimators = 15
max_estimators = 175 #225
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(X_train, y_train)
# Record the OOB error for each n_estimators=i setting
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs "n_estimators" plot
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
Using the n_estimators gotten from the previous graph (110), and the other parameters from the Decision Trees section:
rf_1 = RandomForestClassifier(min_samples_split = 350, min_weight_fraction_leaf = 0.002,
random_state=42).fit(X_train, y_train)
rf_2 = RandomForestClassifier(ccp_alpha=0.000159, random_state=42).fit(X_train, y_train)
rf_3 = RandomForestClassifier(max_depth = 6, min_weight_fraction_leaf = 0.002, random_state=42).fit(X_train, y_train)
rf_4 = RandomForestClassifier(n_estimators=110, max_depth=6, random_state = 42).fit(X_train, y_train)
rf_5 = RandomForestClassifier(n_estimators=110, max_depth=6, max_features = 6, random_state = 42).fit(X_train, y_train)
print('Train accuracy:',rf_1.score(X_train, y_train))
print('Validation accuracy:',rf_1.score(X_val, y_val))
print('Train accuracy:',rf_2.score(X_train, y_train))
print('Validation accuracy:',rf_2.score(X_val, y_val))
print('Train accuracy:',rf_3.score(X_train, y_train))
print('Validation accuracy:',rf_3.score(X_val, y_val))
print('Train accuracy:',rf_4.score(X_train, y_train))
print('Validation accuracy:',rf_4.score(X_val, y_val))
print('Train accuracy:',rf_5.score(X_train, y_train))
print('Validation accuracy:',rf_5.score(X_val, y_val))
# Plot the models' accuracies
models = ['rf_1', 'rf_2', 'rf_3','rf_4','rf_5']
accuracies = [rf_1.score(X_val, y_val), rf_2.score(X_val, y_val), rf_3.score(X_val, y_val),
rf_4.score(X_val, y_val),rf_5.score(X_val, y_val)]
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
plt.bar(data[0], data[1], color='peru')
plt.ylim(0.84, 0.87)
plt.show()
The best one is random forest 2 (rf_2).
labels_train = rf_2.predict(X_train)
labels_val = rf_2.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Changing the threshold does not seem to improve the accuracy of the best random forest!
threshold = 0.4
predicted_proba = rf_2.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
# Also check f1-score micro
f1_score(y_val, labels_val, average='micro')
# Defining the model
log_model = LogisticRegression(random_state=4)
# Fit model to our train data
log_model.fit(X_train,y_train)
# Predict class labels for samples in X_train
labels_train = log_model.predict(X_train)
log_model.score(X_train, y_train)
# Predict class labels for samples in X_val
labels_val = log_model.predict(X_val)
log_model.score(X_val, y_val)
pred_prob = log_model.predict_proba(X_val)
pred_prob
X_train.columns
log_model.coef_
# Since we don't have the residuals, we cannot use the OLS, not applied to logistic regression
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, log_model)
# Check the metrics on the logistic regression
metrics(y_train, labels_train, y_val, labels_val)
# Initialize the model
modelNB = GaussianNB(var_smoothing=0.001)
# Fit it to the train data
modelNB.fit(X = X_train, y = y_train)
# Make the predictions
labels_train = modelNB.predict(X_train)
labels_val = modelNB.predict(X_val)
modelNB.predict_proba(X_val)
print("train score:", modelNB.score(X_train, y_train))
print("validation score:",modelNB.score(X_val, y_val))
print(modelNB.class_prior_)
print(modelNB.class_count_)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelNB)
# Check metrics on GNB
metrics(y_train, labels_train, y_val, labels_val)
model = MLPClassifier(random_state=4)
model.fit(X_train, y_train)
labels_train = model.predict(X_train)
labels_val = model.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
labels_val
losses = model.loss_curve_
iterations = range(model.n_iter_)
sns.lineplot(iterations, losses)
model.loss_
model = MLPClassifier(random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model)
model_1 = MLPClassifier(hidden_layer_sizes=(1),random_state=4)
model_2 = MLPClassifier(hidden_layer_sizes=(3),random_state=4)
model_3 = MLPClassifier(hidden_layer_sizes=(9),random_state=4)
model_4 = MLPClassifier(hidden_layer_sizes=(3, 3),random_state=4)
model_5 = MLPClassifier(hidden_layer_sizes=(5, 5),random_state=4)
model_6 = MLPClassifier(hidden_layer_sizes=(3, 3, 3),random_state=4) # 3 layers each one with 3 units
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_1','M_2','M_3', 'M_4','M_5','M_6'])
show_results(df, model_1, model_2, model_3, model_4, model_5, model_6)
model_7 = MLPClassifier(hidden_layer_sizes=(4, 4)+,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_7'])
show_results(df, model_7)
model_logistic = MLPClassifier(activation = 'logistic',random_state=4)
model_tanh = MLPClassifier(activation = 'tanh',random_state=4)
model_relu = MLPClassifier(activation = 'relu',random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['logistic','tanh','relu'])
show_results(df, model_logistic, model_tanh,model_relu)
Logistic is better: same score in less iterations
Logistic provides a nomalized output between 0 and 1
model_lbfgs = MLPClassifier(solver = 'lbfgs',random_state=4) # Low dim and sparse data
model_sgd = MLPClassifier(solver = 'sgd',random_state=4) # Accuracy > processing time
model_adam = MLPClassifier(solver = 'adam',random_state=4) # Big dataset but might fail to converge
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['lbfgs','sgd','adam'])
show_results(df, model_lbfgs, model_sgd, model_adam)
model_constant = MLPClassifier(solver = 'sgd', learning_rate = 'constant',random_state=4)
model_invscaling = MLPClassifier(solver = 'sgd', learning_rate = 'invscaling',random_state=4)
model_adaptive = MLPClassifier(solver = 'sgd', learning_rate = 'adaptive',random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['constant','invscaling','adaptive'])
show_results(df, model_constant, model_invscaling, model_adaptive)
Constant is the best
model_a = MLPClassifier(solver = 'adam', learning_rate_init = 0.5,random_state=4)
model_b = MLPClassifier(solver = 'adam', learning_rate_init = 0.1,random_state=4)
model_c = MLPClassifier(solver = 'adam', learning_rate_init = 0.01,random_state=4)
model_d = MLPClassifier(solver = 'adam', learning_rate_init = 0.001,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_a','M_b','M_c', "M_d"])
show_results(df, model_a, model_b, model_c, model_d)
The best is 0.1 or 0.01, so we'll test them
model_e = MLPClassifier(solver = 'adam', learning_rate_init = 0.005,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ["M_e"])
show_results(df, model_e)
model_batch20 = MLPClassifier(solver = 'sgd', batch_size = 20,random_state=4)
model_batch50 = MLPClassifier(solver = 'sgd', batch_size = 50,random_state=4)
model_batch100 = MLPClassifier(solver = 'sgd', batch_size = 100,random_state=4)
model_batch200 = MLPClassifier(solver = 'sgd', batch_size = 200,random_state=4)
model_batch500 = MLPClassifier(solver = 'sgd', batch_size = 500,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['batch 20','batch 50','batch 100', 'batch 200', 'batch 500'])
show_results(df, model_batch20, model_batch50, model_batch100, model_batch200, model_batch500)
The best one is batch 50
model_maxiter_50 = MLPClassifier(max_iter = 50,random_state=4)
model_maxiter_100 = MLPClassifier(max_iter = 100,random_state=4)
model_maxiter_200 = MLPClassifier(max_iter = 200,random_state=4)
model_maxiter_300 = MLPClassifier(max_iter = 300,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 50','max iter 100','max iter 200', 'max iter 300'])
show_results(df, model_maxiter_50, model_maxiter_100, model_maxiter_200, model_maxiter_300)
model_maxiter_150 = MLPClassifier(max_iter = 150,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 150'])
show_results(df, model_maxiter_150)
model_all=MLPClassifier(hidden_layer_sizes=(9),activation = 'logistic',solver = 'adam',learning_rate_init = 0.1,batch_size = 50,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model_all)
model_grid=model_all=MLPClassifier(activation= 'logistic', batch_size= 100, hidden_layer_sizes=(9), learning_rate_init= 0.02102040816326531, max_iter= 150, solver= 'adam',random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation','Iterations'], index = ['Raw'])
show_results(df, model_grid)
# parameter_space1 = {
# 'hidden_layer_sizes': [(9),(5,5),(3, 3, 3)],
# 'activation': ['logistic','relu'],
# 'solver': ['lbfgs', 'adam'],
# 'learning_rate_init': [0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,0.01],
# 'batch_size': [(20),(50)],
# 'max_iter': [(150),(200)],
# }
# clf1 = GridSearchCV(model, parameter_space1,n_jobs=-1)
# clf1.fit(X_train, y_train)
# clf1.best_params_
modelNN_best=MLPClassifier(activation= 'relu',batch_size= 20, hidden_layer_sizes= (5, 5),learning_rate_init= 0.004,max_iter= 150,solver= 'adam')
df= pd.DataFrame(columns = ['Time','Train','Val', 'Iterations'], index = ['Raw'])
show_results(df, modelNN_best)
# # Best parameter set
# print('------------------------------------------------------------------------------------------------------------------------')
# print('Best parameters found:\n', clf1.best_params_)
# print('------------------------------------------------------------------------------------------------------------------------')
# # All results
# means = clf1.cv_results_['mean_test_score']
# stds = clf1.cv_results_['std_test_score']
# for mean, std, params in zip(means, stds, clf1.cv_results_['params']):
# print("%0.3f (+/-%0.03f) for %r" % (mean, std , params))
# Model with best accuracy
labels_train = modelNN_best.predict(X_train)
labels_val = modelNN_best.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Check f1-score micro
f1_score(y_val, labels_val, average='micro')
The number K is typically chosen as the square root of the total number of points in the training data set. Thus, in this case, N is 15680, so K = 125.
# Try K=50 through K=150 and record validation accuracy
k_range = range(50, 150)
scores = []
# We use a loop through the range
# We append the scores in the list
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_val)
scores.append(accuracy_score(y_val, y_pred))
# Plot the relationship between K and validation accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Validation Accuracy')
Now, testing many different combinations of parameters.
modelKNN1 = KNeighborsClassifier().fit(X = X_train, y = y_train)
print("train score:", modelKNN1.score(X_train, y_train))
print("validation score:",modelKNN1.score(X_val, y_val))
modelKNN2 = KNeighborsClassifier(n_neighbors=100).fit(X = X_train, y = y_train)
print("train score:", modelKNN2.score(X_train, y_train))
print("validation score:",modelKNN2.score(X_val, y_val))
# From the available algorithms (excluding the default), this was the best one
modelKNN3 = KNeighborsClassifier(n_neighbors=100, algorithm='ball_tree').fit(X = X_train, y = y_train)
print("train score:", modelKNN3.score(X_train, y_train))
print("validation score:",modelKNN3.score(X_val, y_val))
modelKNN4 = KNeighborsClassifier(n_neighbors=100, p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN4.score(X_train, y_train))
print("validation score:",modelKNN4.score(X_val, y_val))
modelKNN5 = KNeighborsClassifier(n_neighbors=100, weights='distance').fit(X = X_train, y = y_train)
print("train score:", modelKNN5.score(X_train, y_train))
print("validation score:",modelKNN5.score(X_val, y_val))
modelKNN6 = KNeighborsClassifier(n_neighbors=100, algorithm='ball_tree', p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN6.score(X_train, y_train))
print("validation score:",modelKNN6.score(X_val, y_val))
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['modelKNN1', 'modelKNN2', 'modelKNN3', 'modelKNN4', 'modelKNN5', 'modelKNN6'])
show_results_1(df, modelKNN1, modelKNN2, modelKNN3, modelKNN4, modelKNN5, modelKNN6)
# Model with best accuracy
labels_train = modelKNN6.predict(X_train)
labels_val = modelKNN6.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Creating and fitting model
pac_basic = PassiveAggressiveClassifier(random_state=42)
pac_basic.fit(X_train, y_train)
pac_1 = PassiveAggressiveClassifier(C=0.001, fit_intercept=True, tol=1e-2, loss='squared_hinge',random_state=42)
pac_1.fit(X_train, y_train)
pac_2 = PassiveAggressiveClassifier(C=0.001, tol=1e-2, loss='squared_hinge',random_state=42)
pac_2.fit(X_train, y_train)
pac_3 = PassiveAggressiveClassifier(C=0.001, tol=1e-2, random_state=42)
pac_3.fit(X_train, y_train)
# Making prediction on the validation set
val_pred_basic = pac_basic.predict(X_val)
val_pred_1 = pac_1.predict(X_val)
val_pred_2 = pac_2.predict(X_val)
val_pred_3 = pac_3.predict(X_val)
df = pd.DataFrame(columns = ['Time','Train','Validation','Iterations'], index = ['PAC_Basic','PAC_1','PAC_2','PAC_3'])
show_results(df, pac_basic, pac_1, pac_2, pac_3)
labels_train = pac_1.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = pac_1.predict(X_val)
accuracy_score(y_val, labels_val)
metrics(y_train, labels_train, y_val, labels_val)
modelLDA = LinearDiscriminantAnalysis()
modelLDA.fit(X = X_train, y = y_train)
labels_train = modelLDA.predict(X_train)
labels_val = modelLDA.predict(X_val)
modelLDA.predict_proba(X_val)
print("train score:", modelLDA.score(X_train, y_train))
print("validation score:",modelLDA.score(X_val, y_val))
# grid = dict()
# grid['shrinkage'] = [None, arange(0, 1, 0.01)]
# grid['solver']=['svd', 'lsqr', 'eigen'] # svd cannot be tested with shrinkage
# # Define search
# search = GridSearchCV(modelLDA, grid, scoring='accuracy', n_jobs=-1)
# # Perform the search
# results = search.fit(X_train, y_train)
# # Summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelLDA_final = LinearDiscriminantAnalysis(solver='lsqr')
modelLDA_final.fit(X = X_train, y = y_train)
labels_train = modelLDA_final.predict(X_train)
labels_val = modelLDA_final.predict(X_val)
print("train score:", modelLDA_final.score(X_train, y_train))
print("validation score:",modelLDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
modelQDA = QuadraticDiscriminantAnalysis()
modelQDA.fit(X = X_train, y = y_train)
labels_train = modelQDA.predict(X_train)
labels_val = modelQDA.predict(X_val)
modelQDA.predict_proba(X_val)
print("train score:", modelQDA.score(X_train, y_train))
print("validation score:",modelQDA.score(X_val, y_val))
# # Define grid
# grid = dict()
# grid['reg_param'] = arange(0, 1, 0.01)
# # Define search
# search = GridSearchCV(modelQDA, grid, scoring='accuracy', n_jobs=-1)
# # Perform the search
# results = search.fit(X_train, y_train)
# # Summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelQDA_final = QuadraticDiscriminantAnalysis(reg_param=0.14)
modelQDA_final.fit(X = X_train, y = y_train)
labels_train = modelQDA_final.predict(X_train)
labels_val = modelQDA_final.predict(X_val)
print("train score:", modelQDA_final.score(X_train, y_train))
print("validation score:",modelQDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
Testing several parameter combinations.
modelSVM_basic = SVC().fit(X_train, y_train)
modelSVM_1 = SVC(kernel='linear').fit(X_train, y_train)
modelSVM_2 = SVC(C=750).fit(X_train, y_train)
modelSVM_3 = SVC(kernel = 'poly').fit(X_train, y_train)
modelSVM_4 = SVC(C=750, kernel = 'poly').fit(X_train, y_train)
modelSVM_5 = SVC(C=750, kernel = 'linear').fit(X_train, y_train)
modelSVM_6 = SVC(C=750, shrinking=False).fit(X_train, y_train)
modelSVM_7 = SVC(C=750, tol=1e-2).fit(X_train, y_train)
# Plot the model's accuracies
accuracies = [modelSVM_basic.score(X_val, y_val), modelSVM_1.score(X_val, y_val),
modelSVM_2.score(X_val, y_val), modelSVM_3.score(X_val, y_val),
modelSVM_4.score(X_val, y_val), modelSVM_5.score(X_val, y_val),
modelSVM_6.score(X_val, y_val), modelSVM_7.score(X_val, y_val)]
models = ['modelSVM_basic', 'modelSVM_1', 'modelSVM_2', 'modelSVM_3',
'modelSVM_4', 'modelSVM_5', 'modelSVM_6', 'modelSVM_7']
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
plt.bar(data[0], data[1], color='peru')
plt.xticks(rotation=90)
plt.ylim(0.80,0.86)
plt.show()
# Highest accuracy from the SVMs
modelSVM_6.score(X_val, y_val)
# Check metrics on the best one
pred_train_svm = modelSVM_6.predict(X_train)
pred_val_svm = modelSVM_6.predict(X_val)
metrics(y_train, pred_train_svm, y_val, pred_val_svm)
# Function to analyze the best parameter definitions
def calculate_f1(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = AdaBoostClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = AdaBoostClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
num_estimators = list(range(1,100))
calculate_f1(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
learning_rate = list(np.arange(0.01, 2, 0.05))
calculate_f1(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
# AdaBoost = AdaBoostClassifier()
# AdaBoost_parameters = {'base_estimator' : [None, modelNB, modelQDA_final, pac_1, modelLDA_final],
# 'n_estimators' : list(range(1,100)),
# 'learning_rate' : np.arange(0.5, 1.5, 0.05),
# 'algorithm' : ['SAMME', 'SAMME.R']}
# AdaBoost_grid = GridSearchCV(estimator=AdaBoost, param_grid=AdaBoost_parameters,
# scoring='accuracy', verbose=1, n_jobs=-1)
# AdaBoost_grid.fit(X_train , y_train)
# AdaBoost_grid.best_params_
# Best AdaBoost based on the grid search
modelAdaBoost = AdaBoostClassifier(base_estimator=None, n_estimators=98, learning_rate=1.2, algorithm='SAMME.R', random_state=5)
modelAdaBoost.fit(X_train,y_train)
labels_train = modelAdaBoost.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelAdaBoost.predict(X_val)
accuracy_score(y_val, labels_val)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelAdaBoost)
# Check the metrics on the best AdaBoost
metrics(y_train, labels_train, y_val, labels_val)
# Function to analyze the best parameter definitions
def calculate_f1_2(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = GradientBoostingClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = GradientBoostingClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
learning_rate = list(np.arange(0.05, 1.5, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
learning_rate = list(np.arange(0.05, 1, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
learning_rate = list(np.arange(0.8, 1.8, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
num_estimators = list(np.arange(1, 200, 10))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
modelGBauto = GradientBoostingClassifier(max_features='auto', random_state=5)
modelGBlog = GradientBoostingClassifier(max_features='log2',random_state=5)
modelGBsqrt = GradientBoostingClassifier(max_features='sqrt',random_state=5)
modelGBnone = GradientBoostingClassifier(max_features=None,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Auto','Log2','Sqrt','None/Raw'])
show_results_1(df, modelGBauto, modelGBlog, modelGBsqrt, modelGBnone)
modelGBdev = GradientBoostingClassifier(loss='deviance', random_state=5)
modelGBexp = GradientBoostingClassifier(loss='exponential',random_state=5)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['deviance','exponential'])
show_results_1(df, modelGBdev, modelGBexp)
modelGB2 = GradientBoostingClassifier(max_depth=2, random_state=5)
modelGB3 = GradientBoostingClassifier(max_depth=3,random_state=5)
modelGB10 = GradientBoostingClassifier(max_depth=10,random_state=5)
modelGB30 = GradientBoostingClassifier(max_depth=30,random_state=5)
modelGB50 = GradientBoostingClassifier(max_depth=50,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['model2','model3','model10','model30','model50'])
show_results_1(df, modelGB2, modelGB3,modelGB10,modelGB30,modelGB50)
# GB_clf = GradientBoostingClassifier()
# GB_parameters = {'loss' : [ 'exponential'],
# 'learning_rate' : np.arange(1.0, 1.6, 0.05),
# 'n_estimators' : np.arange(150, 200, 5),
# 'max_depth' : np.arange(2, 10, 1),
# 'max_features' : ['log2', None]
# }
# GB_grid = GridSearchCV(estimator=GB_clf, param_grid=GB_parameters, scoring='accuracy', verbose=1, n_jobs=-1)
# GB_grid.fit(X_train , y_train)
# GB_grid.best_params_
# Best GB
modelGB = GradientBoostingClassifier(learning_rate=1.0, loss='exponential', max_depth=2, max_features='log2',
n_estimators=170, random_state=5)
modelGB.fit(X_train, y_train)
labels_train = modelGB.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelGB.predict(X_val)
accuracy_score(y_val, labels_val)
# Check f1-score micro
f1_score(y_val, labels_val, average='micro')
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelGB)
# Check metrics on the best GB
metrics(y_train, labels_train, y_val, labels_val)
Code to make the predictions on the Test dataset (modelGB was the best model):
# min_max_scaler = preprocessing.MinMaxScaler()
# metric_scaled = min_max_scaler.fit_transform(test.values)
# test = pd.DataFrame(metric_scaled, columns=test.columns, index=test.index)
# Citizen=df_test['CITIZEN_ID']
# labels_test= modelGB.predict(test)
# prediction=pd.concat([Citizen, pd.DataFrame(labels_test)],axis=1)
# prediction['Income']=prediction[0]
# prediction.drop(columns=0,inplace=True)
# prediction.to_csv(r'PATH\pred.csv',index=False, header=True,sep=',')
# Group of the best models on the Notebook
estimator = []
estimator.append(('GradientBoosting', GradientBoostingClassifier(learning_rate=1.0, loss='exponential', max_depth=2,
max_features='log2', n_estimators=170, random_state=5)))
estimator.append(('AdaBoost', AdaBoostClassifier(base_estimator=None, n_estimators=98, learning_rate=1.2,
algorithm='SAMME.R', random_state=5)))
estimator.append(('Decision Tree', DecisionTreeClassifier(ccp_alpha=0.000159, random_state=42)))
estimator.append(('Random Forest', RandomForestClassifier(ccp_alpha=0.000159, random_state=42)))
estimator.append(('SVM', SVC(C=750, shrinking=False, probability=True))) # Probability is True because it's needed for
# the soft voting
# Voting Classifier with hard voting (default)
voting_hard = VotingClassifier(estimators=estimator, n_jobs=-1)
voting_hard.fit(X_train, y_train)
y_pred_hard = voting_hard.predict(X_val)
# Voting Classifier with soft voting
voting_soft = VotingClassifier(estimators=estimator, n_jobs=-1, voting='soft')
voting_soft.fit(X_train, y_train)
y_pred_soft = voting_soft.predict(X_val)
# Accuracy for hard voting
print("train score:", voting_hard.score(X_train, y_train))
print("validation score:", voting_hard.score(X_val, y_val))
# Accuracy for soft voting
print("train score:", voting_soft.score(X_train, y_train))
print("validation score:", voting_soft.score(X_val, y_val))
# Metrics for hard voting
labels_train = voting_hard.predict(X_train)
metrics(y_train, labels_train, y_val, y_pred_hard)
# Metrics for soft voting
labels_train = voting_soft.predict(X_train)
metrics(y_train, labels_train, y_val, y_pred_soft)
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
from collections import Counter
print('Original dataset shape %s' % Counter(y_train))
smotenc = SMOTENC(random_state=42, categorical_features=list(range(0,7)), k_neighbors=100, n_jobs=-1)
tomek = TomekLinks(n_jobs=-1)
smote_tomek = SMOTETomek(sampling_strategy='all', smote=smotenc, tomek=tomek, n_jobs=-1, random_state=42)
X_train, y_train = smote_tomek.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_train))
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix #confusion_matrix to evaluate the accuracy of a classification
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
import time
from sklearn.model_selection import KFold
from sklearn.tree import export_graphviz
import graphviz
import pydotplus
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from numpy import mean
from numpy import std
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
from sklearn.svm import SVC
# Functions to be used in all models to assess them
def metrics(y_train, pred_train , y_val, pred_val):
print('_____________________________________')
print(' TRAIN ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_train, pred_train))
print(confusion_matrix(y_train, pred_train)) #true neg and true pos, false positives and false neg
print('__________________________+_________')
print(' VALIDATION ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_val, pred_val))
print(confusion_matrix(y_val, pred_val))
def avg_score(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the test
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
avg_iter = round(np.mean(n_iter),1)
std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val), str(avg_iter) + '+/-' + str(std_iter)
def show_results(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val, avg_iter = avg_score(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val, avg_iter
count+=1
return df
# For the models that don't have n_iter attribute
def avg_score_1(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the validation
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
#n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
#avg_iter = round(np.mean(n_iter),1)
#std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val)
#, str(avg_iter) + '+/-' + str(std_iter)
def show_results_1(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val = avg_score_1(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val
count+=1
return df
def plot_tree(model_tree):
dot_data = export_graphviz(model_tree,
feature_names=X_train.columns,
class_names=["Income lower or equal to avg", "Income higher than avg"],
filled=True)
pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.set_size('"20,20"')
return graphviz.Source(pydot_graph.to_string())
#AUC
def calculate_AUC(interval, x_train, x_val, y_train, y_val, parameter, max_depth = None):
train_results = []
val_results = []
for value in interval:
if (parameter == 'max_depth'):
dt = DecisionTreeClassifier(max_depth = value, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'max_features'):
dt = DecisionTreeClassifier(max_features = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_split'):
dt = DecisionTreeClassifier(min_samples_split = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_leaf'):
dt = DecisionTreeClassifier(min_samples_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_weight_fraction_leaf'):
dt = DecisionTreeClassifier(min_weight_fraction_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_impurity_decrease'):
dt = DecisionTreeClassifier(min_impurity_decrease = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
train_pred = dt.predict(x_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous train results
train_results.append(roc_auc)
y_pred = dt.predict(x_val)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous validation results
val_results.append(roc_auc)
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best validation value is ',interval[value_val])
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(interval, train_results, 'b', label="Train AUC")
line2, = plt.plot(interval, val_results, 'r', label="Validation AUC")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("AUC score")
plt.xlabel(str(parameter))
plt.show()
dt_entropy = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train)
dt_gini = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Gini','Entropy'])
show_results_1(df,dt_gini, dt_entropy)
dt_random = DecisionTreeClassifier(splitter = 'random').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['best','random'])
show_results_1(df,dt_gini, dt_random)
max_depths = np.linspace(1, 15, 15, endpoint=True)
calculate_AUC(max_depths, X_train, X_val, y_train, y_val, 'max_depth')
dt_depth10 = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train)
dt_depth5 = DecisionTreeClassifier(max_depth = 5).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['full','depth10','depth5'])
show_results_1(df,dt_gini, dt_depth10,dt_depth5)
# Quanto maior, mais overfitting! Com 6 é o melhor resultado dos 3 (menos overfitting e maior score no validation)
max_features = list(range(1,len(X_train.columns)))
calculate_AUC(max_features, X_train, X_val, y_train, y_val,'max_features', 10)
min_samples_split = list(range(10,1000))
calculate_AUC(min_samples_split, X_train, X_val, y_train, y_val,'min_samples_split', 10)
dt_min10 = DecisionTreeClassifier(min_samples_split = 10).fit(X_train, y_train)
dt_min234 = DecisionTreeClassifier(min_samples_split = 234).fit(X_train, y_train)
dt_min250 = DecisionTreeClassifier(min_samples_split = 250).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['dt_min10','dt_min234','dt_min250'])
show_results_1(df, dt_min10, dt_min234, dt_min250)
min_samples_leaf = list(range(10,1001))
calculate_AUC(min_samples_leaf, X_train, X_val, y_train, y_val,'min_samples_leaf', 10)
dt_min_leaf24 = DecisionTreeClassifier(min_samples_split = 24).fit(X_train, y_train)
dt_min_leaf400 = DecisionTreeClassifier(min_samples_split = 400).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min leaf 24','Min leaf 400'])
show_results_1(df, dt_gini, dt_min_leaf24, dt_min_leaf400)
min_weight_fraction_leaf = np.linspace(0, 0.5, 250, endpoint=True)
calculate_AUC(min_weight_fraction_leaf, X_train, X_val, y_train, y_val,'min_weight_fraction_leaf', 10)
dt_min_weight_1 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.001).fit(X_train, y_train)
dt_min_weight_2 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.01).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min weight small','Min weight med'])
show_results_1(df, dt_gini, dt_min_weight_1, dt_min_weight_2)
min_impurity_decrease = np.linspace(0, 0.05, 500, endpoint=True)
calculate_AUC(min_impurity_decrease, X_train, X_val, y_train, y_val,'min_impurity_decrease', 10)
dt_impurity01 = DecisionTreeClassifier(min_impurity_decrease=0.01).fit(X_train, y_train)
dt_impurity0001 = DecisionTreeClassifier(min_impurity_decrease=0.0001).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Baseline','dt_impurity01','dt_impurity0001'])
show_results_1(df,dt_gini, dt_impurity01,dt_impurity0001)
#ccp_alpha
dt_alpha = DecisionTreeClassifier(random_state=42)
path = dt_alpha.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize = (10,10))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha", fontsize=15)
ax.set_ylabel("total impurity of leaves", fontsize=15)
ax.set_title("Total Impurity vs effective alpha for training set", fontsize=15)
#a função abaixo não aceitava ccp_alphas menores que 0
ccp_alphas=ccp_alphas[ccp_alphas>0]
trees = []
for ccp_alpha in ccp_alphas:
dt_alpha = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha).fit(X_train, y_train)
trees.append(dt_alpha)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(trees[-1].tree_.node_count, ccp_alphas[-1]))
trees = trees[:-1]
ccp_alphas = ccp_alphas[:-1]
train_scores = [tree.score(X_train, y_train) for tree in trees]
val_scores = [tree.score(X_val, y_val) for tree in trees]
fig, ax = plt.subplots(figsize = (10,10))
ax.set_xlabel("alpha", fontsize=15)
ax.set_ylabel("accuracy", fontsize=15)
ax.set_title("Accuracy vs alpha for training and validation sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, val_scores, marker='o', label="validation", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(val_scores)
best_model = trees[index_best_model]
print('ccp_alpha of best model: ',trees[index_best_model])
print('_____________________________________________________________')
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Validation accuracy of best model: ',best_model.score(X_val, y_val))
dt_t1=DecisionTreeClassifier(splitter = 'random', max_depth = 5, min_samples_split=7,
min_weight_fraction_leaf = 0.01, min_impurity_decrease=0.01,random_state=42).fit(X_train, y_train)
dt_t2=DecisionTreeClassifier(max_depth = 5,min_weight_fraction_leaf = 0.002,random_state=42).fit(X_train, y_train)
dt_t3=DecisionTreeClassifier(splitter = 'random', max_depth = 5, min_samples_split=7,
min_weight_fraction_leaf = 0.001, random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t1.score(X_train, y_train))
print('Validation accuracy:',dt_t1.score(X_val, y_val))
print('Train accuracy:',dt_t2.score(X_train, y_train))
print('Validation accuracy:',dt_t2.score(X_val, y_val))
print('Train accuracy:',dt_t3.score(X_train, y_train))
print('Validation accuracy:',dt_t3.score(X_val, y_val))
MELHOR!!
dt_t4=DecisionTreeClassifier(splitter = 'random', max_depth = 5, min_samples_split = 400,
min_weight_fraction_leaf = 0.001, random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t4.score(X_train, y_train))
print('Validation accuracy:',dt_t4.score(X_val, y_val))
# Criando ainda a tree dada como melhor pelo ccp_alpha:
dt_t5=DecisionTreeClassifier(ccp_alpha=9.647542354307014e-05, random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t5.score(X_train, y_train))
print('Validation accuracy:',dt_t5.score(X_val, y_val))
print("train score:", dt_2.score(X_train, y_train))
print("validation score:",dt_2.score(X_val, y_val))
#changing the threshold improves or not the accuracy......?
threshold = 0.4
predicted_proba = dt_2.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
# To build the ROC curve
prob_model1 = dt_t1.predict_proba(X_val)
prob_model2 = dt_t2.predict_proba(X_val)
prob_model3 = dt_t3.predict_proba(X_val)
prob_model4 = dt_t4.predict_proba(X_val)
prob_model5 = dt_t5.predict_proba(X_val)
fpr_1, tpr_1, thresholds_1 = roc_curve(y_val, prob_model1[:, 1])
fpr_2, tpr_2, thresholds_2 = roc_curve(y_val, prob_model2[:, 1])
fpr_3, tpr_3, thresholds_3 = roc_curve(y_val, prob_model3[:, 1])
fpr_4, tpr_4, thresholds_4 = roc_curve(y_val, prob_model4[:, 1])
fpr_5, tpr_5, thresholds_5 = roc_curve(y_val, prob_model5[:, 1])
plt.plot(fpr_1, tpr_1, label="ROC Curve model 1")
plt.plot(fpr_2, tpr_2, label="ROC Curve model 2")
plt.plot(fpr_3, tpr_3, label="ROC Curve model 3")
plt.plot(fpr_4, tpr_4, label="ROC Curve model 4")
plt.plot(fpr_5, tpr_5, label="ROC Curve model 5")
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
# o melhor aparenta ter overfitting!!!
labels_train = dt_t3.predict(X_train)
labels_val = dt_t3.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
random = 42
ensemble_clfs = [
("RandomForestClassifier, max_features='auto'",
RandomForestClassifier(oob_score=True,
max_features='auto',
random_state=random)),
("RandomForestClassifier, max_features=6",
RandomForestClassifier(max_features=6,
oob_score=True,
random_state=random)),
("RandomForestClassifier, max_features=None",
RandomForestClassifier(max_features=None,
oob_score=True,
random_state=random))
]
from collections import OrderedDict
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 175 #225
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(X_train, y_train)
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
# # Creating and fitting the models
# rf_1 = RandomForestClassifier(n_estimators=110, max_depth=10, random_state = 42)
# rf_1=rf_1.fit(X_train, y_train)
# rf_2 = RandomForestClassifier(n_estimators=110, max_depth=10, max_features = 6, random_state = 42)
# rf_2=rf_2.fit(X_train, y_train)
# rf_3 = RandomForestClassifier(n_estimators=110, max_depth=10, min_samples_split=17, random_state = 42)
# rf_3=rf_3.fit(X_train, y_train)
rf_1= RandomForestClassifier(min_samples_split = 350, min_weight_fraction_leaf = 0.002,random_state=42).fit(X_train, y_train)
rf_2= RandomForestClassifier(ccp_alpha=0.000159, random_state=42).fit(X_train, y_train)
rf_3= RandomForestClassifier(max_depth = 6, min_weight_fraction_leaf = 0.002, random_state=42).fit(X_train, y_train)
rf_4= RandomForestClassifier(n_estimators=110, max_depth=6, random_state = 42).fit(X_train, y_train)
rf_5 = RandomForestClassifier(n_estimators=110, max_depth=6, max_features = 6, random_state = 42).fit(X_train, y_train)
print("train score:", rf_1.score(X_train, y_train))
print("validation score:",rf_1.score(X_val, y_val))
print("train score:", rf_2.score(X_train, y_train))
print("validation score:",rf_2.score(X_val, y_val))
print("train score:", rf_3.score(X_train, y_train))
print("validation score:",rf_3.score(X_val, y_val))
print('Train accuracy:',rf_4.score(X_train, y_train))
print('Validation accuracy:',rf_4.score(X_val, y_val))
print('Train accuracy:',rf_5.score(X_train, y_train))
print('Validation accuracy:',rf_5.score(X_val, y_val))
rf_6= RandomForestClassifier(random_state=42).fit(X_train, y_train)
print('Train accuracy:',rf_6.score(X_train, y_train))
print('Validation accuracy:',rf_6.score(X_val, y_val))
labels_train = rf_4.predict(X_train)
labels_val = rf_4.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
#changing the threshold does not seem to improve the accuracy of the best random forest!
threshold = 0.4
predicted_proba = rf_2.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
#importing and defining the model
log_model = LogisticRegression(random_state=4)
log_model.fit(X_train,y_train) #fit model to our train data
labels_train = log_model.predict(X_train)
#log_model.score(X_train, y_train)
#Predict class labels for samples in X
labels_val = log_model.predict(X_val)
#log_model.score(X_val, y_val)
#predict values for X_test, ex: p o citizen em X_test [0] estamos a prever y[0]->0
print("train score:", log_model.score(X_train, y_train))
print("validation score:",log_model.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
#ability of the classifier to not label a sample as positive if it is negative
#recall: ability of the classifier to find all the positive samples
#bad model: from all the dataset, what are the ones we are getting right
#f1: weighted harmonic mean of the precision and recall
modelNB = GaussianNB()
modelNB.fit(X = X_train, y = y_train)
print("train score:", modelNB.score(X_train, y_train))
print("validation score:",modelNB.score(X_val, y_val))
modelNB2 = GaussianNB(var_smoothing=0.0001)
modelNB2.fit(X = X_train, y = y_train)
print("train score:", modelNB2.score(X_train, y_train))
print("validation score:",modelNB2.score(X_val, y_val))
modelNB3 = GaussianNB(var_smoothing=0.001)
modelNB3.fit(X = X_train, y = y_train)
print("train score:", modelNB3.score(X_train, y_train))
print("validation score:",modelNB3.score(X_val, y_val))
modelNB4 = GaussianNB(var_smoothing=0.01)
modelNB4.fit(X = X_train, y = y_train)
print("train score:", modelNB4.score(X_train, y_train))
print("validation score:",modelNB4.score(X_val, y_val))
labels_train = modelNB4.predict(X_train)
labels_val = modelNB4.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
model = MLPClassifier(random_state=4)
model.fit(X_train, y_train)
labels_train = model.predict(X_train)
labels_val = model.predict(X_val)
print("train score:", model.score(X_train, y_train))
print("validation score:",model.score(X_val, y_val))
f1_score(y_val, labels_val, average='micro')
metrics(y_train, labels_train, y_val, labels_val)
# test
# Confirmar se já está normalizado
# min_max_scaler = preprocessing.MinMaxScaler()
# metric_scaled = min_max_scaler.fit_transform(test.values)
# test= pd.DataFrame(metric_scaled, columns=test.columns, index=test.index)
Citizen=df_test['CITIZEN_ID']
labels_test= model.predict(test)
prediction=pd.concat([Citizen, pd.DataFrame(labels_test)],axis=1)
prediction['Income']=prediction[0]
prediction.drop(columns=0,inplace=True)
prediction.to_csv(r'C:\Users\matip\Documents\Mestrado\Machine Learning\Project\Proj\Predictions\Pred4.csv',index=False, header=True,sep=',')
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model)
model_1 = MLPClassifier(hidden_layer_sizes=(1),random_state=4)
model_2 = MLPClassifier(hidden_layer_sizes=(3),random_state=4)
model_3 = MLPClassifier(hidden_layer_sizes=(9),random_state=4)
model_4 = MLPClassifier(hidden_layer_sizes=(3, 3),random_state=4)
model_5 = MLPClassifier(hidden_layer_sizes=(5, 5),random_state=4)
model_6 = MLPClassifier(hidden_layer_sizes=(3, 3, 3),random_state=4) #3 layers each one with 3 units
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_1','M_2','M_3', 'M_4','M_5','M_6'])
show_results(df, model_1, model_2, model_3, model_4, model_5, model_6)
model_7 = MLPClassifier(hidden_layer_sizes=(4, 4),random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_7'])
show_results(df, model_7)
testes: M5,M3,M6,M7,M4
model_logistic = MLPClassifier(activation = 'logistic',random_state=4)
model_tanh = MLPClassifier(activation = 'tanh',random_state=4)
model_relu=MLPClassifier(activation = 'relu',random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['logistic','tanh','relu'])
show_results(df, model_logistic, model_tanh,model_relu)
Relu is better
model_lbfgs = MLPClassifier(solver = 'lbfgs',random_state=4) #low dim and sparse data
model_sgd = MLPClassifier(solver = 'sgd',random_state=4) #accuracy > processing time
model_adam = MLPClassifier(solver = 'adam',random_state=4) # big dataset but might fail to converge
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['lbfgs','sgd','adam'])
show_results(df, model_lbfgs, model_sgd, model_adam)
Testar LBFGS & ADAM
model_constant = MLPClassifier(solver = 'sgd', learning_rate = 'constant',random_state=4)
model_invscaling = MLPClassifier(solver = 'sgd', learning_rate = 'invscaling',random_state=4)
model_adaptive = MLPClassifier(solver = 'sgd', learning_rate = 'adaptive',random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['constant','invscaling','adaptive'])
show_results(df, model_constant, model_invscaling, model_adaptive)
Testar Constant & Adaptive
model_a = MLPClassifier(solver = 'adam', learning_rate_init = 0.5,random_state=4) #qt maior mais rapido aprende o modelo
model_b = MLPClassifier(solver = 'adam', learning_rate_init = 0.1,random_state=4)
model_c = MLPClassifier(solver = 'adam', learning_rate_init = 0.01,random_state=4) #se for mt pequeno pode ficar preso numa solucao subotima e pode nunca convergir
model_d = MLPClassifier(solver = 'adam', learning_rate_init = 0.001,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_a','M_b','M_c', "M_d"])
show_results(df, model_a, model_b, model_c, model_d)
The best is 0.01 or 0.001 so test
model_e = MLPClassifier(solver = 'adam', learning_rate_init = 0.005,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ["M_e"])
show_results(df, model_e)
#USAR SÓ 0.005
model_batch20 = MLPClassifier(solver = 'sgd', batch_size = 20,random_state=4)
model_batch50 = MLPClassifier(solver = 'sgd', batch_size = 50,random_state=4)
model_batch100 = MLPClassifier(solver = 'sgd', batch_size = 100,random_state=4)
model_batch200 = MLPClassifier(solver = 'sgd', batch_size = 200,random_state=4)
model_batch500 = MLPClassifier(solver = 'sgd', batch_size = 500,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['batch 20','batch 50','batch 100', 'batch 200', 'batch 500'])
show_results(df, model_batch20, model_batch50, model_batch100, model_batch200, model_batch500)
The best one is batch 20
model_maxiter_50 = MLPClassifier(max_iter = 50,random_state=4)
model_maxiter_100 = MLPClassifier(max_iter = 100,random_state=4)
model_maxiter_200 = MLPClassifier(max_iter = 200,random_state=4)
model_maxiter_300 = MLPClassifier(max_iter = 300,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 50','max iter 100','max iter 200', 'max iter 300'])
show_results(df, model_maxiter_50, model_maxiter_100, model_maxiter_200, model_maxiter_300)
opções entre 150 e 300
model_maxiter_150 = MLPClassifier(max_iter = 150,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 150'])
show_results(df, model_maxiter_150)
model_all=MLPClassifier(hidden_layer_sizes=(9),activation = 'logistic',solver = 'adam',learning_rate_init = 0.1,batch_size = 50,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model_all)
# parameter_space = {
# 'hidden_layer_sizes': [(5,5),(3,3,3)],
# 'activation': ['tanh','relu'],
# 'solver': ['adam'],
# 'learning_rate_init': [(0.005)],
# 'batch_size': [(20)],
# 'max_iter': [(150),(200),(300)],
# }
# clf = GridSearchCV(model, parameter_space,n_jobs=-1)
# clf.fit(X_train, y_train)
# clf.best_params_
NNgrid=MLPClassifier(random_state=4,hidden_layer_sizes=(5,5),activation='tanh',solver='adam',learning_rate_init=0.005,batch_size=20,max_iter=150).fit(X_train,y_train)
print("train score:", NNgrid.score(X_train, y_train))
print("validation score:",NNgrid.score(X_val, y_val))
labels_train = NNgrid.predict(X_train)
labels_val = NNgrid.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
print('Training data length:',len(X_train))
print('Validation data length:',len(X_val))
The number K is typically chosen as the square root of the total number of points in the training data set. Thus, in this case, N is 15680 (agora é 23559), so K = 153.
# try K=70 through K=170 and record testing accuracy
k_range = range(5, 170)
scores = []
# We use a loop through the range
# We append the scores in the list
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_val)
scores.append(accuracy_score(y_val, y_pred))
# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Validation Accuracy')
# o n_neighbors default é 5
modelKNN1 = KNeighborsClassifier().fit(X = X_train, y = y_train)
print("train score:", modelKNN1.score(X_train, y_train))
print("validation score:",modelKNN1.score(X_val, y_val))
modelKNN2 = KNeighborsClassifier(n_neighbors=100).fit(X = X_train, y = y_train)
print("train score:", modelKNN2.score(X_train, y_train))
print("validation score:",modelKNN2.score(X_val, y_val))
modelKNN3 = KNeighborsClassifier(n_neighbors=12).fit(X = X_train, y = y_train)
print("train score:", modelKNN3.score(X_train, y_train))
print("validation score:",modelKNN3.score(X_val, y_val))
modelKNN4 = KNeighborsClassifier(n_neighbors=10).fit(X = X_train, y = y_train)
print("train score:", modelKNN4.score(X_train, y_train))
print("validation score:",modelKNN4.score(X_val, y_val))
#from the available algorithms (excluding the default), this was the best one
# tinham n_neighbors=100, deixei o default
modelKNN5 = KNeighborsClassifier(n_neighbors=10,algorithm='ball_tree').fit(X = X_train, y = y_train)
print("train score:", modelKNN3.score(X_train, y_train))
print("validation score:",modelKNN3.score(X_val, y_val))
modelKNN6 = KNeighborsClassifier(n_neighbors=10,p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN6.score(X_train, y_train))
print("validation score:",modelKNN6.score(X_val, y_val))
MELHOR ATÉ AGORA
modelKNN7 = KNeighborsClassifier(n_neighbors=10,p=1,weights='distance').fit(X = X_train, y = y_train)
print("train score:", modelKNN7.score(X_train, y_train))
print("validation score:",modelKNN7.score(X_val, y_val))
modelKNN8 = KNeighborsClassifier(n_neighbors=10,algorithm='ball_tree', p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN8.score(X_train, y_train))
print("validation score:",modelKNN8.score(X_val, y_val))
# Model with best accuracy
labels_train = modelKNN6.predict(X_train)
labels_val = modelKNN6.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Creating and fitting model
pac_basic = PassiveAggressiveClassifier(random_state=42)
pac_basic.fit(X_train, y_train)
pac_1 = PassiveAggressiveClassifier(C=0.001, fit_intercept=True, tol=1e-2, loss='squared_hinge',random_state=42)
pac_1.fit(X_train, y_train)
pac_2 = PassiveAggressiveClassifier(C=0.001, tol=1e-2, loss='squared_hinge',random_state=42)
pac_2.fit(X_train, y_train)
pac_3 = PassiveAggressiveClassifier(C=0.001, tol=1e-2, random_state=42)
pac_3.fit(X_train, y_train)
print("train score:", pac_basic.score(X_train, y_train))
print("validation score:",pac_basic.score(X_val, y_val))
print("train score:", pac_1.score(X_train, y_train))
print("validation score:",pac_1.score(X_val, y_val))
print("train score:", pac_2.score(X_train, y_train))
print("validation score:",pac_2.score(X_val, y_val))
print("train score:", pac_3.score(X_train, y_train))
print("validation score:",pac_3.score(X_val, y_val))
pac_4 = PassiveAggressiveClassifier(C=0.01, loss='squared_hinge',fit_intercept=True,random_state=42).fit(X_train, y_train)
print("train score:", pac_4.score(X_train, y_train))
print("validation score:",pac_4.score(X_val, y_val))
pac_4 = PassiveAggressiveClassifier(C=0.005, loss='squared_hinge',fit_intercept=True,random_state=42).fit(X_train, y_train)
print("train score:", pac_4.score(X_train, y_train))
print("validation score:",pac_4.score(X_val, y_val))
labels_train = pac_basic.predict(X_train)
labels_val = pac_basic.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
modelLDA = LinearDiscriminantAnalysis()
modelLDA.fit(X = X_train, y = y_train)
labels_train = modelLDA.predict(X_train)
labels_val = modelLDA.predict(X_val)
modelLDA.predict_proba(X_val)
print("train score:", modelLDA.score(X_train, y_train))
print("validation score:",modelLDA.score(X_val, y_val))
# grid = dict()
# grid['shrinkage'] = [None, np.arange(0, 1, 0.01)]
# grid['solver']=['svd', 'lsqr', 'eigen'] #svd cannot be tested with shrinkage
# # define search
# search = GridSearchCV(modelLDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelLDA_final = LinearDiscriminantAnalysis(solver='lsqr')
modelLDA_final.fit(X = X_train, y = y_train)
labels_train = modelLDA_final.predict(X_train)
labels_val = modelLDA_final.predict(X_val)
print("train score:", modelLDA_final.score(X_train, y_train))
print("validation score:",modelLDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
modelQDA = QuadraticDiscriminantAnalysis()
modelQDA.fit(X = X_train, y = y_train)
labels_train = modelQDA.predict(X_train)
labels_val = modelQDA.predict(X_val)
#modelQDA.predict_proba(X_val)
print("train score:", modelQDA.score(X_train, y_train))
print("validation score:",modelQDA.score(X_val, y_val))
# # define grid
# grid = dict()
# grid['reg_param'] = arange(0, 1, 0.01)
# # define search
# search = GridSearchCV(modelQDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelQDA_final = QuadraticDiscriminantAnalysis(reg_param=0.02)
modelQDA_final.fit(X = X_train, y = y_train)
labels_train = modelQDA_final.predict(X_train)
labels_val = modelQDA_final.predict(X_val)
print("train score:", modelQDA_final.score(X_train, y_train))
print("validation score:",modelQDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
modelSVM_basic = SVC().fit(X_train, y_train)
modelSVM_1 = SVC(kernel='linear').fit(X_train, y_train)
modelSVM_2 = SVC(C=750).fit(X_train, y_train)
modelSVM_3 = SVC(kernel = 'poly').fit(X_train, y_train)
modelSVM_4 = SVC(C=750, kernel = 'poly').fit(X_train, y_train)
modelSVM_5 = SVC(C=750, kernel = 'linear').fit(X_train, y_train)
modelSVM_6 = SVC(C=750, shrinking=False).fit(X_train, y_train)
modelSVM_7 = SVC(C=750, tol=1e-2).fit(X_train, y_train)
accuracies = [modelSVM_basic.score(X_val, y_val), modelSVM_1.score(X_val, y_val),
modelSVM_2.score(X_val, y_val), modelSVM_3.score(X_val, y_val),
modelSVM_4.score(X_val, y_val), modelSVM_5.score(X_val, y_val),
modelSVM_6.score(X_val, y_val), modelSVM_7.score(X_val, y_val)]
models = ['modelSVM_basic', 'modelSVM_1', 'modelSVM_2', 'modelSVM_3',
'modelSVM_4', 'modelSVM_5', 'modelSVM_6', 'modelSVM_7']
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
data
print("train score:", modelSVM_4.score(X_train, y_train))
print("validation score:",modelSVM_4.score(X_val, y_val))
print("train score:", modelSVM_2.score(X_train, y_train))
print("validation score:",modelSVM_2.score(X_val, y_val))
print("train score:", modelSVM_7.score(X_train, y_train))
print("validation score:",modelSVM_7.score(X_val, y_val))
print("train score:", modelSVM_1.score(X_train, y_train))
print("validation score:",modelSVM_1.score(X_val, y_val))
modelSVM_8 = SVC(C=10, kernel = 'poly').fit(X_train, y_train)
print("train score:", modelSVM_8.score(X_train, y_train))
print("validation score:",modelSVM_8.score(X_val, y_val))
modelSVM_9 = SVC(C=1, kernel = 'poly').fit(X_train, y_train)
print("train score:", modelSVM_9.score(X_train, y_train))
print("validation score:",modelSVM_9.score(X_val, y_val))
labels_train = modelSVM_8.predict(X_train)
labels_val = modelSVM_8.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
def calculate_f1(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = AdaBoostClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = AdaBoostClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
num_estimators = list(range(70,130))
calculate_f1(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
learning_rate = list(np.arange(0.5, 2.5, 0.05))
calculate_f1(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
# AdaBoost = AdaBoostClassifier()
# AdaBoost_parameters = {'base_estimator' : [None, modelNB, modelQDA_final, pac_1, modelLDA_final],
# 'n_estimators' : list(range(70,130)),
# 'learning_rate' : np.arange(1.0, 2.0, 0.05),
# 'algorithm' : ['SAMME', 'SAMME.R']}
# AdaBoost_grid = GridSearchCV(estimator=AdaBoost, param_grid=AdaBoost_parameters,
# scoring='accuracy', verbose=1, n_jobs=-1)
# AdaBoost_grid.fit(X_train , y_train)
# AdaBoost_grid.best_params_
modelAdaBoost = AdaBoostClassifier(base_estimator=None, n_estimators=98, learning_rate=1.2, algorithm='SAMME.R', random_state=5)
modelAdaBoost.fit(X_train,y_train)
labels_train = modelAdaBoost.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelAdaBoost.predict(X_val)
accuracy_score(y_val, labels_val)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelAdaBoost)
metrics(y_train, labels_train, y_val, labels_val)
def calculate_f1_2(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = GradientBoostingClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = GradientBoostingClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
learning_rate = list(np.arange(0.01, 0.5, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
learning_rate = list(np.arange(0.05, 1, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
num_estimators = list(np.arange(1, 200, 10))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
modelGBauto = GradientBoostingClassifier(max_features='auto', random_state=5)
modelGBlog = GradientBoostingClassifier(max_features='log2',random_state=5)
modelGBsqrt = GradientBoostingClassifier(max_features='sqrt',random_state=5)
modelGBnone = GradientBoostingClassifier(max_features=None,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Auto','Log2','Sqrt','None/Raw'])
show_results_1(df, modelGBauto, modelGBlog, modelGBsqrt, modelGBnone)
modelGBdev = GradientBoostingClassifier(loss='deviance', random_state=5)
modelGBexp = GradientBoostingClassifier(loss='exponential',random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['deviance','exponential'])
show_results_1(df, modelGBdev, modelGBexp)
modelGB2 = GradientBoostingClassifier(max_depth=2, random_state=5)
modelGB3 = GradientBoostingClassifier(max_depth=3,random_state=5)
modelGB10 = GradientBoostingClassifier(max_depth=10,random_state=5)
modelGB30 = GradientBoostingClassifier(max_depth=30,random_state=5)
modelGB50 = GradientBoostingClassifier(max_depth=50,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['model2','model3','model10','model30','model50'])
show_results_1(df, modelGB2, modelGB3,modelGB10,modelGB30,modelGB50)
# GB_clf = GradientBoostingClassifier()
# GB_parameters = {'loss' : [ 'deviance'],
# 'learning_rate' : np.arange(0.05, 0.8, 0.05),
# 'n_estimators' : np.arange(130, 200, 5),
# 'max_depth' : np.arange(1, 3, 1),
# 'max_features' : ['auto', None]
# }
# GB_grid = GridSearchCV(estimator=GB_clf, param_grid=GB_parameters, scoring='accuracy', verbose=1, n_jobs=-1)
# GB_grid.fit(X_train , y_train)
# GB_grid.best_params_
modelGB = GradientBoostingClassifier(learning_rate=1.0, loss='exponential', max_depth=2, max_features='log2',
n_estimators=170, random_state=5)
modelGB.fit(X_train, y_train)
labels_train = modelGB.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelGB.predict(X_val)
accuracy_score(y_val, labels_val)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelGB)
metrics(y_train, labels_train, y_val, labels_val)
# Teste
# modelGB = GradientBoostingClassifier(learning_rate=1.0, max_depth=2, max_features='log2',
# n_estimators=170, random_state=5)
# modelGB.fit(X_train, y_train)
# labels_train = modelGB.predict(X_train)
# print(accuracy_score(y_train, labels_train))
# labels_val = modelGB.predict(X_val)
# print(accuracy_score(y_val, labels_val))
df_train2.info()
df_train3=df_train2.copy()
# Removing outliers from these variables and then applying min max as before
filters = (
(df_train3['Money Received']<120000)
&
(df_train3['Ticket Price']<4000)
)
df_train_out=df_train3[filters]
target_out=target[filters]
metric= df_train_out.loc[:,(np.array(df_train2.dtypes=="int64")) | (np.array(df_train2.dtypes=="float64"))]
# Normalizing using min max
min_max_scaler = preprocessing.MinMaxScaler()
metric_scaled = min_max_scaler.fit_transform(metric.values)
stand_metric= pd.DataFrame(metric_scaled, columns=metric.columns, index=metric.index)
sns.set(style="white")
# Compute the correlation matrix
corr = stand_metric.corr() #Getting correlation of numerical variables
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool) #Return an array of zeros (Falses) with the same shape and type as a given array
mask[np.triu_indices_from(mask)] = True #The upper-triangle array is now composed by True values
# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(20, 12))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True) #Make a diverging palette between two HUSL colors. Return a matplotlib colormap object.
# Draw the heatmap with the mask and correct aspect ratio
#show only corr bigger than 0.6 in absolute value
sns.heatmap(corr[(corr>=.7) | (corr<=-.7)], mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, ax=ax)
# Layout
plt.subplots_adjust(top=0.95)
plt.suptitle("Correlation matrix", fontsize=20)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
# Fixing the bug of partially cut-off bottom and top cells
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
#no of features
nof_list=np.arange(1,len(stand_metric.columns)+1)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
# we are going to see in the next class this "train_test_split()"...
X_train, X_test, y_train, y_test = train_test_split(stand_metric,target_out, test_size = 0.3, random_state = 0)
model = LogisticRegression()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
#baseline c minmax 7 features 0.811 antes sem tirar outliers
rfe = RFE(estimator = model, n_features_to_select = 6)
X_rfe = rfe.fit_transform(X = stand_metric, y = target_out)
model = LogisticRegression().fit(X = X_rfe,y = target_out)
selected_features = pd.Series(rfe.support_, index = stand_metric.columns)
selected_features
#minmax sem tirar outliers nao tira Working hours * Years of Education
#Lasso
def plot_importance(coef,name):
imp_coef = coef.sort_values()
plt.figure(figsize=(8,10))
imp_coef.plot(kind = "barh", color="peru")
plt.title("Feature importance using " + name + " Model")
plt.show()
reg = LassoCV()
reg.fit(X=stand_metric, y=target_out)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X = stand_metric,y = target_out))
coef = pd.Series(reg.coef_, index = stand_metric.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
plot_importance(coef,'Lasso') #minmax sem tirar outliers nao tira nenhuma
Lasso: tirar log10 money received
ridge = RidgeClassifierCV().fit(X = stand_metric,y = target_out)
coef_ridge = pd.Series(ridge.coef_[0], index = stand_metric.columns)
def plot_importance(coef,name):
imp_coef = coef.sort_values()
plt.figure(figsize=(8,10))
imp_coef.plot(kind = "barh", color="peru")
plt.title("Feature importance using " + name + " Model")
plt.show()
plot_importance(coef_ridge,'RidgeClassifier')
#minmax: money/yearseduc more important and ticket price
model = LogisticRegression()
forward = SFS(model, k_features=9, forward=True, scoring="accuracy", cv = None) #floating=False
forward.fit(stand_metric, target_out)
forward_table = pd.DataFrame.from_dict(forward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
forward_table
#min max sem tirar: more important is the 2th 0.8167
forward_table_max = forward_table['avg_score'].max()
forward_table_max #aqui o melhor está a ser manter 8
forward_table[forward_table['avg_score']==forward_table_max]['feature_names'].values
backward = SFS(model, k_features=1, forward=False, scoring="accuracy", cv = None) #floating=False
backward.fit(stand_metric, target_out)
backward_table = pd.DataFrame.from_dict(backward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
backward_table
backward_table_max = backward_table['avg_score'].max()
backward_table_max
#money receive loses importance here
#chosing same nº of variables (6), this has a higher score(0.82) than minmax (0.816)
backward_table[backward_table['avg_score']==backward_table_max]['feature_names'].values
stand_metric.drop(columns=['Working Hours per week', 'Log 10 of Money Received', 'Log 10 of Ticket Price'], inplace=True)
all_selected_variables = stand_metric.merge(non_metric_selected, left_index=True, right_index=True, how='left')
non_metric_selected
all_selected_variables
model = LogisticRegression()
Forward:
forward = SFS(model, k_features=15, forward=True, scoring="accuracy", cv = None) #floating=False
forward.fit(all_selected_variables, target_out)
forward_table = pd.DataFrame.from_dict(forward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
forward_table
o 10 é o melhor tendo em conta both o numero de variaves e o score
forward_table.loc[10, 'avg_score']
forward_table.loc[10, 'feature_names']
Backward
backward = SFS(model, k_features=1, forward=False, scoring="accuracy", cv = None) #floating=False
backward.fit(all_selected_variables, target_out)
backward_table = pd.DataFrame.from_dict(backward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
backward_table
backward_table.loc[9, 'avg_score'] #o melhor é 9 tendo em conta ambas
backward_table.loc[9, 'feature_names']
Mantaining the variables that appear on both the forward and backward selections:
all_selected_variables=all_selected_variables[['x1_Management','x2_Married','x5_1','Years of Education',
'Ticket Price','Age','Working hours * Years of Education']]
non_metric_bf=all_selected_variables[['x1_Management','x2_Married','x5_1']]
stand_metric=all_selected_variables[['Years of Education','Ticket Price','Age','Working hours * Years of Education']]
from scipy.stats import pointbiserialr
print('Point biserial between binary and metric variables:\n')
for i in non_metric_bf.columns:
for j in stand_metric.columns:
pb = pointbiserialr(non_metric_bf[i], stand_metric[j])
if abs(pb[0]) > 0.5:
print(i, 'and', j, ':', round(pb[0], 3))
all_selected_variables.columns
all_variables_test = pd.concat([df_test, ohc_df_test], axis=1)
test=all_variables_test[['x1_Management','x2_Married','x5_1','Years of Education',
'Ticket Price','Age','Working hours * Years of Education']]
X_train, X_val, y_train, y_val = train_test_split(all_selected_variables,
target_out,
test_size = 0.3,
random_state = 42,
shuffle=True,
stratify=target_out)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix #confusion_matrix to evaluate the accuracy of a classification
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
import time
from sklearn.model_selection import KFold
from sklearn.tree import export_graphviz
import graphviz
import pydotplus
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from numpy import mean
from numpy import std
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
from sklearn.svm import SVC
# Functions to be used in all models to assess them
def metrics(y_train, pred_train , y_val, pred_val):
print('_____________________________________')
print(' TRAIN ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_train, pred_train))
print(confusion_matrix(y_train, pred_train)) #true neg and true pos, false positives and false neg
print('__________________________+_________')
print(' VALIDATION ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_val, pred_val))
print(confusion_matrix(y_val, pred_val))
def avg_score(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the test
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
avg_iter = round(np.mean(n_iter),1)
std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val), str(avg_iter) + '+/-' + str(std_iter)
def show_results(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val, avg_iter = avg_score(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val, avg_iter
count+=1
return df
# For the models that don't have n_iter attribute
def avg_score_1(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the validation
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
#n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
#avg_iter = round(np.mean(n_iter),1)
#std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val)
#, str(avg_iter) + '+/-' + str(std_iter)
def show_results_1(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val = avg_score_1(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val
count+=1
return df
def plot_tree(model_tree):
dot_data = export_graphviz(model_tree,
feature_names=X_train.columns,
class_names=["Income lower or equal to avg", "Income higher than avg"],
filled=True)
pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.set_size('"20,20"')
return graphviz.Source(pydot_graph.to_string())
#AUC
def calculate_AUC(interval, x_train, x_val, y_train, y_val, parameter, max_depth = None):
train_results = []
val_results = []
for value in interval:
if (parameter == 'max_depth'):
dt = DecisionTreeClassifier(max_depth = value, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'max_features'):
dt = DecisionTreeClassifier(max_features = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_split'):
dt = DecisionTreeClassifier(min_samples_split = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_leaf'):
dt = DecisionTreeClassifier(min_samples_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_weight_fraction_leaf'):
dt = DecisionTreeClassifier(min_weight_fraction_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_impurity_decrease'):
dt = DecisionTreeClassifier(min_impurity_decrease = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
train_pred = dt.predict(x_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous train results
train_results.append(roc_auc)
y_pred = dt.predict(x_val)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous validation results
val_results.append(roc_auc)
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best validation value is ',interval[value_val])
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(interval, train_results, 'b', label="Train AUC")
line2, = plt.plot(interval, val_results, 'r', label="Validation AUC")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("AUC score")
plt.xlabel(str(parameter))
plt.show()
Nota: parameters in decision trees don't really improve performance, they're meant to control overfitting
dt_entropy = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train)
dt_gini = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Gini','Entropy'])
show_results_1(df,dt_gini, dt_entropy)
dt_random = DecisionTreeClassifier(splitter = 'random').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['best','random'])
show_results_1(df,dt_gini, dt_random)
max_depths = np.linspace(1, 15, 15, endpoint=True)
calculate_AUC(max_depths, X_train, X_val, y_train, y_val, 'max_depth')
dt_depth9 = DecisionTreeClassifier(max_depth = 9).fit(X_train, y_train)
dt_depth5 = DecisionTreeClassifier(max_depth = 5).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['full','depth9','depth5'])
show_results_1(df,dt_gini, dt_depth9,dt_depth5)
# Quanto maior, mais overfitting! O maior score no val é a 9, mas a 5 tá mt identica e tem menos overfit
max_features = list(range(1,len(X_train.columns)))
calculate_AUC(max_features, X_train, X_val, y_train, y_val,'max_features', 9)
# Acho que não é necessário! Pelo gráfico vê-se que daria overfitting e já temos um número aceitável de variáveis
# Para além de que ele escolhe aleatoriamente as variáveis a usar
min_samples_split = list(range(10,600))
calculate_AUC(min_samples_split, X_train, X_val, y_train, y_val,'min_samples_split', 9)
dt_min84 = DecisionTreeClassifier(min_samples_split = 84).fit(X_train, y_train)
dt_min150 = DecisionTreeClassifier(min_samples_split = 150).fit(X_train, y_train)
dt_min400 = DecisionTreeClassifier(min_samples_split = 400).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['dt_min17','dt_min150','dt_min400'])
show_results_1(df, dt_min84, dt_min150, dt_min400)
# Aqui quanto menor, mais overfitting! Com 400 já dá um resultado equilibrado (é o melhor e mais generalizado)
min_samples_leaf = list(range(10,600))
calculate_AUC(min_samples_leaf, X_train, X_val, y_train, y_val,'min_samples_leaf', 9)
dt_min_leaf11 = DecisionTreeClassifier(min_samples_split = 11).fit(X_train, y_train)
dt_min_leaf170 = DecisionTreeClassifier(min_samples_split = 170).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min leaf 11','Min leaf 170'])
show_results_1(df, dt_gini, dt_min_leaf11, dt_min_leaf170)
# mais confuso de chegar a uma conclusão e tem o mesmo efeito que min_samples_split
# Quanto maior -> + underfitting, menor(default)-> full grown tree (overfitting)
# 170 parece ser melhor?
# more useful for imbalanced datasets!
min_weight_fraction_leaf = np.linspace(0, 0.3, 250, endpoint=True)
calculate_AUC(min_weight_fraction_leaf, X_train, X_val, y_train, y_val,'min_weight_fraction_leaf', 9)
dt_min_weight_1 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.037).fit(X_train, y_train)
dt_min_weight_2 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.01).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min weight from graph','Min weight small'])
show_results_1(df, dt_gini, dt_min_weight_1, dt_min_weight_2)
# Usar diferente de 0.0 já fez diferença! Usar 0.01, parece ser bom!
min_impurity_decrease = np.linspace(0, 0.05, 500, endpoint=True)
calculate_AUC(min_impurity_decrease, X_train, X_val, y_train, y_val,'min_impurity_decrease', 9)
dt_impurity01 = DecisionTreeClassifier(min_impurity_decrease=0.01).fit(X_train, y_train)
dt_impurity0001 = DecisionTreeClassifier(min_impurity_decrease=0.0001).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Baseline','dt_impurity01','dt_impurity0001'])
show_results_1(df,dt_gini, dt_impurity01,dt_impurity0001)
# Melhor é min_impurity_decrease=0.0001!
#ccp_alpha
dt_alpha = DecisionTreeClassifier(random_state=42)
path = dt_alpha.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize = (10,10))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha", fontsize=15)
ax.set_ylabel("total impurity of leaves", fontsize=15)
ax.set_title("Total Impurity vs effective alpha for training set", fontsize=15)
#a função abaixo não aceitava ccp_alphas menores que 0
ccp_alphas=ccp_alphas[ccp_alphas>0]
trees = []
for ccp_alpha in ccp_alphas:
dt_alpha = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha).fit(X_train, y_train)
trees.append(dt_alpha)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(trees[-1].tree_.node_count, ccp_alphas[-1]))
trees = trees[:-1]
ccp_alphas = ccp_alphas[:-1]
train_scores = [tree.score(X_train, y_train) for tree in trees]
val_scores = [tree.score(X_val, y_val) for tree in trees]
fig, ax = plt.subplots(figsize = (10,10))
ax.set_xlabel("alpha", fontsize=15)
ax.set_ylabel("accuracy", fontsize=15)
ax.set_title("Accuracy vs alpha for training and validation sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, val_scores, marker='o', label="validation", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(val_scores)
best_model = trees[index_best_model]
print('ccp_alpha of best model: ',trees[index_best_model])
print('_____________________________________________________________')
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Validation accuracy of best model: ',best_model.score(X_val, y_val))
dt_t1=DecisionTreeClassifier(min_impurity_decrease=0.0001,max_depth = 9,min_samples_split = 400,min_weight_fraction_leaf = 0.01,random_state=42).fit(X_train, y_train)
dt_t2=DecisionTreeClassifier(max_depth = 9,min_weight_fraction_leaf = 0.01,random_state=42).fit(X_train, y_train)
dt_t3=DecisionTreeClassifier(min_samples_split = 400,min_weight_fraction_leaf = 0.01,random_state=42).fit(X_train, y_train)
dt_t4=DecisionTreeClassifier(max_depth = 9,min_samples_split = 400,min_weight_fraction_leaf = 0.01,random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t1.score(X_train, y_train))
print('Validation accuracy:',dt_t1.score(X_val, y_val))
print('Train accuracy:',dt_t2.score(X_train, y_train))
print('Validation accuracy:',dt_t2.score(X_val, y_val))
print('Train accuracy:',dt_t3.score(X_train, y_train))
print('Validation accuracy:',dt_t3.score(X_val, y_val))
print('Train accuracy:',dt_t4.score(X_train, y_train))
print('Validation accuracy:',dt_t4.score(X_val, y_val))
# Criando ainda a tree dada como melhor pelo ccp_alpha:
dt_t5=DecisionTreeClassifier(ccp_alpha=0.000145, random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t5.score(X_train, y_train))
print('Validation accuracy:',dt_t5.score(X_val, y_val))
#changing the threshold improves or not the accuracy......?
threshold = 0.5
predicted_proba = dt_t5.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
# To build the ROC curve
prob_model1 = dt_t1.predict_proba(X_val)
prob_model2 = dt_t2.predict_proba(X_val)
prob_model3 = dt_t3.predict_proba(X_val)
prob_model4 = dt_t4.predict_proba(X_val)
prob_model5 = dt_t5.predict_proba(X_val)
fpr_1, tpr_1, thresholds_1 = roc_curve(y_val, prob_model1[:, 1])
fpr_2, tpr_2, thresholds_2 = roc_curve(y_val, prob_model2[:, 1])
fpr_3, tpr_3, thresholds_3 = roc_curve(y_val, prob_model3[:, 1])
fpr_4, tpr_4, thresholds_4 = roc_curve(y_val, prob_model4[:, 1])
fpr_5, tpr_5, thresholds_5 = roc_curve(y_val, prob_model5[:, 1])
plt.plot(fpr_1, tpr_1, label="ROC Curve model 1")
plt.plot(fpr_2, tpr_2, label="ROC Curve model 2")
plt.plot(fpr_3, tpr_3, label="ROC Curve model 3")
plt.plot(fpr_4, tpr_4, label="ROC Curve model 4")
plt.plot(fpr_5, tpr_5, label="ROC Curve model 5")
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
# deu super semelhante, difícil ver qual o melhor: mas a melhor parece ser a roxa, que é a 5, que tem o ccp_alpha alterado
# e é essa, de facto, a melhor no validation
O melhor é a decision tree 5, com o único parâmetro alterado o ccp_alpha
labels_train = dt_t5.predict(X_train)
labels_val = dt_t5.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Isto é só para ver a complexidade da árvore
print('The "best" tree has a depth of ' + str(dt_t5.get_depth()) + ', ' + str(dt_t5.tree_.node_count) +
' nodes and a total of ' + str(dt_t5.get_n_leaves()) + ' leaves.')
ensemble_clfs = [
("RandomForestClassifier, max_features='auto'",
RandomForestClassifier(oob_score=True,
max_features='auto',
random_state=42)),
("RandomForestClassifier, max_features='log2'",
RandomForestClassifier(max_features='log2',
oob_score=True,
random_state=42)),
("RandomForestClassifier, max_features=6",
RandomForestClassifier(max_features=6,
oob_score=True,
random_state=42)),
("RandomForestClassifier, max_features=None",
RandomForestClassifier(max_features=None,
oob_score=True,
random_state=42))
]
from collections import OrderedDict
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 175 #225
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(X_train, y_train)
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
# Creating and fitting the models
rf_1 = RandomForestClassifier(n_estimators=85, max_depth=9, random_state = 42).fit(X_train, y_train)
rf_2 = RandomForestClassifier(n_estimators=85, max_depth=9, max_features = 'log2', random_state = 42).fit(X_train, y_train)
rf_3 = RandomForestClassifier(n_estimators=85, max_depth=9, min_samples_split=400, random_state = 42).fit(X_train, y_train)
rf_4 = RandomForestClassifier(min_samples_split = 400, min_weight_fraction_leaf = 0.01,random_state=42).fit(X_train, y_train)
rf_5 = RandomForestClassifier(ccp_alpha=0.000145, random_state=42).fit(X_train, y_train)
rf_6 = RandomForestClassifier(max_depth = 9, min_weight_fraction_leaf = 0.01, random_state=42).fit(X_train, y_train)
rf_7 = RandomForestClassifier(n_estimators=85, max_depth=5, random_state = 42).fit(X_train, y_train)
rf_8 = RandomForestClassifier(n_estimators=85, max_depth=5, max_features = 6, random_state = 42).fit(X_train, y_train)
print('Train accuracy:',rf_1.score(X_train, y_train))
print('Validation accuracy:',rf_1.score(X_val, y_val))
print('Train accuracy:',rf_2.score(X_train, y_train))
print('Validation accuracy:',rf_2.score(X_val, y_val))
print('Train accuracy:',rf_3.score(X_train, y_train))
print('Validation accuracy:',rf_3.score(X_val, y_val))
print('Train accuracy:',rf_4.score(X_train, y_train))
print('Validation accuracy:',rf_4.score(X_val, y_val))
print('Train accuracy:',rf_5.score(X_train, y_train))
print('Validation accuracy:',rf_5.score(X_val, y_val))
print('Train accuracy:',rf_6.score(X_train, y_train))
print('Validation accuracy:',rf_6.score(X_val, y_val))
print('Train accuracy:',rf_7.score(X_train, y_train))
print('Validation accuracy:',rf_7.score(X_val, y_val))
print('Train accuracy:',rf_8.score(X_train, y_train))
print('Validation accuracy:',rf_8.score(X_val, y_val))
models = ['rf_1', 'rf_2', 'rf_3','rf_4','rf_5', 'rf_6', 'rf_7', 'rf_8']
accuracies = [rf_1.score(X_val, y_val), rf_2.score(X_val, y_val), rf_3.score(X_val, y_val), rf_4.score(X_val, y_val),
rf_5.score(X_val, y_val), rf_6.score(X_val, y_val), rf_7.score(X_val, y_val), rf_8.score(X_val, y_val)]
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
plt.bar(data[0], data[1], color='peru')
plt.ylim(0.84, 0.855)
plt.show()
labels_train = rf_2.predict(X_train)
labels_val = rf_2.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
#predict values for X_test, ex: p o citizen em X_test [0] estamos a prever y[0]->0
#changing the threshold does not seem to improve the accuracy of the best random forest!
threshold = 0.5
predicted_proba = rf_2.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
#importing and defining the model
log_model = LogisticRegression(random_state=42)
log_model.fit(X_train,y_train) #fit model to our train data
labels_train = log_model.predict(X_train)
log_model.score(X_train, y_train)
# Predict class labels for samples in X
labels_val = log_model.predict(X_val)
log_model.score(X_val, y_val)
# predict values for X_test, ex: p o citizen em X_test [0] estamos a prever y[0]->0
pred_prob = log_model.predict_proba(X_val)
pred_prob
# o cutoff normalmente é de 0.5, mas as vezes é preferivel considerar menos
X_train.columns
log_model.coef_
#since we dont have the residuals, we cannot use the OLS, not applied to logistic regression
#c estes valores so conseguimos dizer q se for + a curva é p cima, - para baixo
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, log_model)
metrics(y_train, labels_train, y_val, labels_val)
#modelNB = GaussianNB() # train score: 0.823 validation score: 0.814
#modelNB = GaussianNB(var_smoothing=0.0001) #train score: 0.823 validation score: 0.815
modelNB = GaussianNB(var_smoothing=0.001) # train score: 0.823 validation score: 0.815
modelNB.fit(X = X_train, y = y_train)
labels_train = modelNB.predict(X_train)
labels_val = modelNB.predict(X_val)
modelNB.predict_proba(X_val)
print("train score:", modelNB.score(X_train, y_train))
print("validation score:",modelNB.score(X_val, y_val))
# Para ver unbalancedness, média e variância para cada
print(modelNB.class_prior_) #prob 0, prob 1
print(modelNB.class_count_)#n 0, n 1
# modelNB.theta_
# modelNB.sigma_
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelNB)
metrics(y_train, labels_train, y_val, labels_val)
model = MLPClassifier(random_state=42)
model.fit(X_train, y_train)
labels_train = model.predict(X_train)
labels_val = model.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
model_1 = MLPClassifier(hidden_layer_sizes=(1),random_state=42)
model_2 = MLPClassifier(hidden_layer_sizes=(3),random_state=42)
model_3 = MLPClassifier(hidden_layer_sizes=(9),random_state=42)
model_4 = MLPClassifier(hidden_layer_sizes=(3, 3),random_state=42)
model_5 = MLPClassifier(hidden_layer_sizes=(5, 5),random_state=42)
model_6 = MLPClassifier(hidden_layer_sizes=(3, 3, 3),random_state=42) #3 layers each one with 3 units
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_1','M_2','M_3', 'M_4','M_5','M_6'])
show_results(df, model_1, model_2, model_3, model_4, model_5, model_6)
model_7 = MLPClassifier(hidden_layer_sizes=(4, 4),random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_7'])
show_results(df, model_7)
model_logistic = MLPClassifier(activation = 'logistic',random_state=42)
model_tanh = MLPClassifier(activation = 'tanh',random_state=42)
model_relu=MLPClassifier(activation = 'relu',random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['logistic','tanh','relu'])
show_results(df, model_logistic, model_tanh,model_relu)
model_lbfgs = MLPClassifier(solver = 'lbfgs',random_state=4) #low dim and sparse data
model_sgd = MLPClassifier(solver = 'sgd',random_state=4) #accuracy > processing time
model_adam = MLPClassifier(solver = 'adam',random_state=4) # big dataset but might fail to converge
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['lbfgs','sgd','adam'])
show_results(df, model_lbfgs, model_sgd, model_adam)
Adam is the best
model_constant = MLPClassifier(solver = 'sgd', learning_rate = 'constant',random_state=42)
model_invscaling = MLPClassifier(solver = 'sgd', learning_rate = 'invscaling',random_state=42)
model_adaptive = MLPClassifier(solver = 'sgd', learning_rate = 'adaptive',random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['constant','invscaling','adaptive'])
show_results(df, model_constant, model_invscaling, model_adaptive)
Constant is the better (less time and iterations than adaptive!!)
model_a = MLPClassifier(solver = 'adam', learning_rate_init = 0.5,random_state=42) #qt maior mais rapido aprende o modelo
model_b = MLPClassifier(solver = 'adam', learning_rate_init = 0.1,random_state=42)
model_c = MLPClassifier(solver = 'adam', learning_rate_init = 0.01,random_state=42) #se for mt pequeno pode ficar preso numa solucao subotima e pode nunca convergir
model_d = MLPClassifier(solver = 'adam', learning_rate_init = 0.001,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_a','M_b','M_c', "M_d"])
show_results(df, model_a, model_b, model_c, model_d)
The best is 0.01
model_e = MLPClassifier(solver = 'adam', learning_rate_init = 0.005,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ["M_e"])
show_results(df, model_e)
model_batch20 = MLPClassifier(solver = 'sgd', batch_size = 20,random_state=42)
model_batch50 = MLPClassifier(solver = 'sgd', batch_size = 50,random_state=42)
model_batch100 = MLPClassifier(solver = 'sgd', batch_size = 100,random_state=42)
model_batch200 = MLPClassifier(solver = 'sgd', batch_size = 200,random_state=42)
model_batch500 = MLPClassifier(solver = 'sgd', batch_size = 500,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['batch 20','batch 50','batch 100', 'batch 200', 'batch 500'])
show_results(df, model_batch20, model_batch50, model_batch100, model_batch200, model_batch500)
The best one is batch 20
model_maxiter_50 = MLPClassifier(max_iter = 50,random_state=42)
model_maxiter_100 = MLPClassifier(max_iter = 100,random_state=42)
model_maxiter_200 = MLPClassifier(max_iter = 200,random_state=42)
model_maxiter_300 = MLPClassifier(max_iter = 300,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 50','max iter 100','max iter 200', 'max iter 300'])
show_results(df, model_maxiter_50, model_maxiter_100, model_maxiter_200, model_maxiter_300)
model_maxiter_150 = MLPClassifier(max_iter = 150,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 150'])
show_results(df, model_maxiter_150)
model_all=MLPClassifier(hidden_layer_sizes=(9),activation = 'logistic',solver = 'adam',learning_rate_init = 0.1,batch_size = 50,random_state=4)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model_all)
# parameter_space = {
# 'hidden_layer_sizes': [(9), (3,3,3)],
# 'activation': ['relu'],
# 'solver': ['adam'],
# 'learning_rate': ['adaptive'],
# 'learning_rate_init': [(0.01)],
# 'batch_size': list(np.arange(10, 40, 10)),
# 'max_iter': list(np.arange(100, 400, 50)),
# }
# clf = GridSearchCV(model, parameter_space, verbose=1, n_jobs=-1)
# clf.fit(X_train , y_train)
# clf.best_params_
model_grid=MLPClassifier(activation= 'relu', batch_size= 30, hidden_layer_sizes=(9), learning_rate='adaptive',
learning_rate_init= 0.01, max_iter= 100, solver= 'adam',random_state=4)
model_grid.fit(X_train, y_train)
labels_train = model_grid.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = model_grid.predict(X_val)
accuracy_score(y_val, labels_val)
metrics(y_train, labels_train, y_val, labels_val)
The number K is typically chosen as the square root of the total number of points in the training data set. Thus, in this case, N is 15680, so K = 125.
# try K=50 through K=150 and record testing accuracy
k_range = range(50, 150)
scores = []
# We use a loop through the range
# We append the scores in the list
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_val)
scores.append(accuracy_score(y_val, y_pred))
# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Validation Accuracy')
modelKNN1 = KNeighborsClassifier().fit(X = X_train, y = y_train)
print("train score:", modelKNN1.score(X_train, y_train))
print("validation score:",modelKNN1.score(X_val, y_val))
modelKNN2 = KNeighborsClassifier(n_neighbors=70).fit(X = X_train, y = y_train)
print("train score:", modelKNN2.score(X_train, y_train))
print("validation score:",modelKNN2.score(X_val, y_val))
#from the available algorithms (excluding the default), this was the best one
modelKNN3 = KNeighborsClassifier(n_neighbors=70, algorithm='ball_tree').fit(X = X_train, y = y_train)
print("train score:", modelKNN3.score(X_train, y_train))
print("validation score:",modelKNN3.score(X_val, y_val))
modelKNN4 = KNeighborsClassifier(n_neighbors=70, p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN4.score(X_train, y_train))
print("validation score:",modelKNN4.score(X_val, y_val))
modelKNN5 = KNeighborsClassifier(n_neighbors=70, weights='distance').fit(X = X_train, y = y_train)
print("train score:", modelKNN5.score(X_train, y_train))
print("validation score:",modelKNN5.score(X_val, y_val))
modelKNN6 = KNeighborsClassifier(n_neighbors=70, algorithm='ball_tree', p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN6.score(X_train, y_train))
print("validation score:",modelKNN6.score(X_val, y_val))
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['modelKNN1', 'modelKNN2', 'modelKNN3', 'modelKNN4', 'modelKNN5', 'modelKNN6'])
show_results_1(df, modelKNN1, modelKNN2, modelKNN3, modelKNN4, modelKNN5, modelKNN6)
# Model with best accuracy
labels_train = modelKNN6.predict(X_train)
labels_val = modelKNN6.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Creating and fitting model
pac_basic = PassiveAggressiveClassifier(random_state=42)
pac_basic.fit(X_train, y_train)
pac_1 = PassiveAggressiveClassifier(C=0.001, fit_intercept=True, tol=1e-3, loss='squared_hinge',random_state=42)
pac_1.fit(X_train, y_train)
pac_2 = PassiveAggressiveClassifier(C=0.001, tol=1e-3, loss='squared_hinge',random_state=42)
pac_2.fit(X_train, y_train)
pac_3 = PassiveAggressiveClassifier(C=0.001, tol=1e-3, random_state=42)
pac_3.fit(X_train, y_train)
# Making prediction on the validation set
val_pred_basic = pac_basic.predict(X_val)
val_pred_1 = pac_1.predict(X_val)
val_pred_2 = pac_2.predict(X_val)
val_pred_3 = pac_3.predict(X_val)
df = pd.DataFrame(columns = ['Time','Train','Validation','Iterations'], index = ['PAC_Basic','PAC_1','PAC_2','PAC_3'])
show_results(df, pac_basic, pac_1, pac_2, pac_3)
labels_train = pac_2.predict(X_train)
labels_val = pac_2.predict(X_val)
print('train accuracy:',accuracy_score(y_train, labels_train))
print('validation accuracy:',accuracy_score(y_val, labels_val))
metrics(y_train, labels_train, y_val, labels_val)
modelLDA = LinearDiscriminantAnalysis()
modelLDA.fit(X = X_train, y = y_train)
labels_train = modelLDA.predict(X_train)
labels_val = modelLDA.predict(X_val)
modelLDA.predict_proba(X_val)
print("train score:", modelLDA.score(X_train, y_train))
print("validation score:",modelLDA.score(X_val, y_val))
# # define grid
# grid = dict()
# grid['solver'] = ['svd', 'lsqr', 'eigen']
# # define search
# search = GridSearchCV(modelLDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
# from numpy import arange
# grid = dict()
# grid['shrinkage'] = arange(0, 1, 0.01)
# grid['solver']=['svd', 'lsqr', 'eigen'] #svd cannot be tested with shrinkage
# # define search
# search = GridSearchCV(modelLDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelLDA_final = LinearDiscriminantAnalysis(solver='svd')
modelLDA_final.fit(X = X_train, y = y_train)
labels_train = modelLDA_final.predict(X_train)
labels_val = modelLDA_final.predict(X_val)
print("train score:", modelLDA_final.score(X_train, y_train))
print("validation score:",modelLDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
modelQDA = QuadraticDiscriminantAnalysis()
modelQDA.fit(X = X_train, y = y_train)
labels_train = modelQDA.predict(X_train)
labels_val = modelQDA.predict(X_val)
modelQDA.predict_proba(X_val)
print("train score:", modelQDA.score(X_train, y_train))
print("validation score:",modelQDA.score(X_val, y_val))
# # define grid
# grid = dict()
# grid['reg_param'] = arange(0, 1, 0.01)
# # define search
# search = GridSearchCV(modelQDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelQDA_final = QuadraticDiscriminantAnalysis(reg_param=0.04)
modelQDA_final.fit(X = X_train, y = y_train)
labels_train = modelQDA_final.predict(X_train)
labels_val = modelQDA_final.predict(X_val)
print("train score:", modelQDA_final.score(X_train, y_train))
print("validation score:",modelQDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
# # try C=250 through K=1250 and record testing accuracy
# C_range = range(250, 1250)
# scores = []
# # We use a loop through the range
# # We append the scores in the list
# for c in C_range:
# svm = SVC(C=c)
# svm.fit(X_train, y_train)
# y_pred = svm.predict(X_val)
# scores.append(accuracy_score(y_val, y_pred))
# # plot the relationship between C and testing accuracy
# plt.plot(C_range, scores)
# plt.xlabel('Value of C for the SVM')
# plt.ylabel('Validation Accuracy')
modelSVM_basic = SVC().fit(X_train, y_train)
modelSVM_1 = SVC(kernel='linear').fit(X_train, y_train)
modelSVM_2 = SVC(C=750).fit(X_train, y_train)
modelSVM_3 = SVC(kernel = 'poly').fit(X_train, y_train)
modelSVM_4 = SVC(C=750, kernel = 'poly').fit(X_train, y_train)
modelSVM_5 = SVC(C=750, kernel = 'linear').fit(X_train, y_train)
modelSVM_6 = SVC(C=750, shrinking=False).fit(X_train, y_train)
modelSVM_7 = SVC(C=750, tol=1e-3).fit(X_train, y_train)
accuracies = [modelSVM_basic.score(X_val, y_val), modelSVM_1.score(X_val, y_val),
modelSVM_2.score(X_val, y_val), modelSVM_3.score(X_val, y_val),
modelSVM_4.score(X_val, y_val), modelSVM_5.score(X_val, y_val),
modelSVM_6.score(X_val, y_val), modelSVM_7.score(X_val, y_val)]
models = ['modelSVM_basic', 'modelSVM_1', 'modelSVM_2', 'modelSVM_3',
'modelSVM_4', 'modelSVM_5', 'modelSVM_6', 'modelSVM_7']
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
plt.bar(data[0], data[1], color='peru')
plt.xticks(rotation=90)
plt.ylim(0.80,0.86)
plt.show()
# highest accuracy from the SVMs
modelSVM_basic.score(X_val, y_val)
modelSVM_basic.score(X_train, y_train)
pred_train_svm = modelSVM_basic.predict(X_train)
pred_val_svm = modelSVM_basic.predict(X_val)
metrics(y_train, pred_train_svm, y_val, pred_val_svm)
def calculate_f1(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = AdaBoostClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = AdaBoostClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
num_estimators = list(range(1,100))
calculate_f1(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(range(1,25))
calculate_f1(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(range(10,250))
calculate_f1(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
learning_rate = list(np.arange(0.01, 2, 0.05))
calculate_f1(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
# AdaBoost = AdaBoostClassifier()
# AdaBoost_parameters = {'base_estimator' : [None, modelNB, modelQDA_final, modelLDA_final],
# 'n_estimators' : list(range(5,20)),
# 'learning_rate' : np.arange(0.3, 0.75, 0.05),
# 'algorithm' : ['SAMME', 'SAMME.R']}
# AdaBoost_grid = GridSearchCV(estimator=AdaBoost, param_grid=AdaBoost_parameters,
# scoring='accuracy', verbose=1, n_jobs=-1)
# AdaBoost_grid.fit(X_train , y_train)
# AdaBoost_grid.best_params_
modelAdaBoost = AdaBoostClassifier(base_estimator=None, n_estimators=18, learning_rate=0.39999999999999997, algorithm='SAMME.R', random_state=5)
modelAdaBoost.fit(X_train,y_train)
labels_train = modelAdaBoost.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelAdaBoost.predict(X_val)
accuracy_score(y_val, labels_val)
# AdaBoost = AdaBoostClassifier()
# AdaBoost_parameters = {'base_estimator' : [None, modelNB, modelQDA_final, modelLDA_final],
# 'n_estimators' : list(range(205,220)),
# 'learning_rate' : np.arange(0.3, 0.75, 0.05),
# 'algorithm' : ['SAMME', 'SAMME.R']}
# AdaBoost_grid = GridSearchCV(estimator=AdaBoost, param_grid=AdaBoost_parameters,
# scoring='accuracy', verbose=1, n_jobs=-1)
# AdaBoost_grid.fit(X_train , y_train)
# AdaBoost_grid.best_params_
modelAdaBoost = AdaBoostClassifier(base_estimator=None, n_estimators=214, learning_rate=0.6499999999999999, algorithm='SAMME.R', random_state=5)
modelAdaBoost.fit(X_train,y_train)
labels_train = modelAdaBoost.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelAdaBoost.predict(X_val)
accuracy_score(y_val, labels_val)
metrics(y_train, labels_train, y_val, labels_val)
def calculate_f1_2(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = GradientBoostingClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = GradientBoostingClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
learning_rate = list(np.arange(0.05, 1.5, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
learning_rate = list(np.arange(0.05, 1, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
learning_rate = list(np.arange(1, 1.8, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
num_estimators = list(np.arange(1, 200, 10))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(np.arange(150, 400, 10))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(np.arange(300, 500, 15))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
modelGBauto = GradientBoostingClassifier(max_features='auto', random_state=5)
modelGBlog = GradientBoostingClassifier(max_features='log2',random_state=5)
modelGBsqrt = GradientBoostingClassifier(max_features='sqrt',random_state=5)
modelGBnone = GradientBoostingClassifier(max_features=None,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Auto','Log2','Sqrt','None/Raw'])
show_results_1(df, modelGBauto, modelGBlog, modelGBsqrt, modelGBnone)
modelGBdev = GradientBoostingClassifier(loss='deviance', random_state=5)
modelGBexp = GradientBoostingClassifier(loss='exponential',random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['deviance','exponential'])
show_results_1(df, modelGBdev, modelGBexp)
modelGB2 = GradientBoostingClassifier(max_depth=2, random_state=5)
modelGB3 = GradientBoostingClassifier(max_depth=3,random_state=5)
modelGB10 = GradientBoostingClassifier(max_depth=10,random_state=5)
modelGB30 = GradientBoostingClassifier(max_depth=30,random_state=5)
modelGB50 = GradientBoostingClassifier(max_depth=50,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['model2','model3','model10','model30','model50'])
show_results_1(df, modelGB2, modelGB3,modelGB10,modelGB30,modelGB50)
# GB_clf = GradientBoostingClassifier()
# GB_parameters = {'loss' : [ 'deviance','exponential'],
# 'learning_rate' : np.arange(1.5, 1.8, 0.05),
# 'n_estimators' : np.arange(300, 350, 5),
# 'max_depth' : np.arange(2, 5, 1),
# 'max_features' : ['auto', None]
# }
# GB_grid = GridSearchCV(estimator=GB_clf, param_grid=GB_parameters, scoring='accuracy', verbose=1, n_jobs=-1)
# GB_grid.fit(X_train , y_train)
# GB_grid.best_params_
modelGB = GradientBoostingClassifier(learning_rate=1.5, loss='exponential', max_depth=2, max_features='auto',
n_estimators=320, random_state=5)
modelGB.fit(X_train, y_train)
labels_train = modelGB.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelGB.predict(X_val)
accuracy_score(y_val, labels_val)
metrics(y_train, labels_train, y_val, labels_val)
df_train2.info()
metric= df_train2.loc[:,(np.array(df_train2.dtypes=="int64")) | (np.array(df_train2.dtypes=="float64"))]
# Normalizing using RobustScaler instead of MinMax
robust = RobustScaler().fit(metric)
robust_metric= robust.transform(metric)
stand_metric= pd.DataFrame(robust_metric, columns=metric.columns, index=metric.index)
sns.set(style="white")
# Compute the correlation matrix
corr = stand_metric.corr() #Getting correlation of numerical variables
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool) #Return an array of zeros (Falses) with the same shape and type as a given array
mask[np.triu_indices_from(mask)] = True #The upper-triangle array is now composed by True values
# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(20, 12))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True) #Make a diverging palette between two HUSL colors. Return a matplotlib colormap object.
# Draw the heatmap with the mask and correct aspect ratio
#show only corr bigger than 0.6 in absolute value
sns.heatmap(corr[(corr>=.7) | (corr<=-.7)], mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, ax=ax)
# Layout
plt.subplots_adjust(top=0.95)
plt.suptitle("Correlation matrix", fontsize=20)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
# Fixing the bug of partially cut-off bottom and top cells
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
#correlation between Money Received and Log 10 of Money Received
round(corr['Money Received']['Log 10 of Money Received'], 3)
#no of features
nof_list=np.arange(1,len(stand_metric.columns)+1)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
# we are going to see in the next class this "train_test_split()"...
X_train, X_test, y_train, y_test = train_test_split(stand_metric,target, test_size = 0.3, random_state = 0)
model = LogisticRegression()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
#baseline c minmax 7 features
rfe = RFE(estimator = model, n_features_to_select = 7)
X_rfe = rfe.fit_transform(X = stand_metric, y = target)
model = LogisticRegression().fit(X = X_rfe,y = target)
selected_features = pd.Series(rfe.support_, index = stand_metric.columns)
selected_features
#minmax só retira logs
#Lasso
def plot_importance(coef,name):
imp_coef = coef.sort_values()
plt.figure(figsize=(8,10))
imp_coef.plot(kind = "barh", color="peru")
plt.title("Feature importance using " + name + " Model")
plt.show()
reg = LassoCV()
reg.fit(X=stand_metric, y=target)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X = stand_metric,y = target))
coef = pd.Series(reg.coef_, index = stand_metric.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
plot_importance(coef,'Lasso') #minmax chose all
ridge = RidgeClassifierCV().fit(X = stand_metric,y = target)
coef_ridge = pd.Series(ridge.coef_[0], index = stand_metric.columns)
def plot_importance(coef,name):
imp_coef = coef.sort_values()
plt.figure(figsize=(8,10))
imp_coef.plot(kind = "barh", color="peru")
plt.title("Feature importance using " + name + " Model")
plt.show()
plot_importance(coef_ridge,'RidgeClassifier')
#minmax: money/yearseduc more important and ticket price
model = LogisticRegression()
forward = SFS(model, k_features=9, forward=True, scoring="accuracy", cv = None) #floating=False
forward.fit(stand_metric, target)
forward_table = pd.DataFrame.from_dict(forward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
forward_table
#min max: more important is the 2th 0.8167
forward_table_max = forward_table['avg_score'].max()
forward_table_max
forward_table[forward_table['avg_score']==forward_table_max]['feature_names'].values
backward = SFS(model, k_features=1, forward=False, scoring="accuracy", cv = None) #floating=False
backward.fit(stand_metric, target)
backward_table = pd.DataFrame.from_dict(backward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
backward_table
backward_table_max = backward_table['avg_score'].max()
backward_table_max
#money receive loses importance here
#chosing same nº of variables (6), this has a higher score(0.82) than minmax (0.816)
backward_table[backward_table['avg_score']==backward_table_max]['feature_names'].values
stand_metric.drop(columns=['Working Hours per week', 'Money Received', 'Ticket Price'], inplace=True)
all_selected_variables = pd.concat([non_metric_selected, stand_metric], axis=1)
all_selected_variables
model = LogisticRegression()
Forward:
forward = SFS(model, k_features=16, forward=True, scoring="accuracy", cv = None) #floating=False
forward.fit(all_selected_variables, target)
forward_table = pd.DataFrame.from_dict(forward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
forward_table
o 8 é o melhor tendo em conta both o numero de variaves e o score
forward_table.loc[8, 'avg_score']
forward_table.loc[8, 'feature_names']
Backward
backward = SFS(model, k_features=1, forward=False, scoring="accuracy", cv = None) #floating=False
backward.fit(all_selected_variables, target)
backward_table = pd.DataFrame.from_dict(backward.get_metric_dict()).T.drop(columns=['cv_scores', 'ci_bound', 'std_dev', 'std_err'])
backward_table
backward_table.loc[9, 'avg_score'] #o melhor é 9 tendo em conta ambas
backward_table.loc[9, 'feature_names']
Mantaining the variables that appear on both the forward and backward selections:
non_metric_bf = non_metric_selected.drop(columns=['Higher Education', 'Male','x2_Single', 'x3_Bachelors', 'x3_Masters', 'x5_1', 'x5_3'])
all_selected_variables.drop(columns=['Higher Education', 'Male','x2_Single', 'x3_Bachelors', 'x3_Masters', 'x5_1', 'x5_3'], inplace=True)
from scipy.stats import pointbiserialr
print('Point biserial between binary and metric variables:\n')
for i in non_metric_bf.columns:
for j in stand_metric.columns:
pb = pointbiserialr(non_metric_bf[i], stand_metric[j])
if abs(pb[0]) > 0.5:
print(i, 'and', j, ':', round(pb[0], 3))
all_selected_variables.columns
all_variables_test = pd.concat([df_test, ohc_df_test], axis=1)
test=all_variables_test[['Age','Years of Education','Working hours * Years of Education','x1_Management','x2_Married','x5_5',
'Log 10 of Ticket Price','x1_Professor','Log 10 of Money Received','Money / YE']]
X_train, X_val, y_train, y_val = train_test_split(all_selected_variables,
target,
test_size = 0.3,
random_state = 42,
shuffle=True,
stratify=target)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix #confusion_matrix to evaluate the accuracy of a classification
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
import time
from sklearn.model_selection import KFold
from sklearn.tree import export_graphviz
import graphviz
import pydotplus
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from numpy import mean
from numpy import std
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
from sklearn.svm import SVC
# Functions to be used in all models to assess them
def metrics(y_train, pred_train , y_val, pred_val):
print('_____________________________________')
print(' TRAIN ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_train, pred_train))
print(confusion_matrix(y_train, pred_train)) #true neg and true pos, false positives and false neg
print('__________________________+_________')
print(' VALIDATION ')
print('-----------------------------------------------------------------------------------------------------------')
print(classification_report(y_val, pred_val))
print(confusion_matrix(y_val, pred_val))
def avg_score(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the test
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
avg_iter = round(np.mean(n_iter),1)
std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val), str(avg_iter) + '+/-' + str(std_iter)
def show_results(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val, avg_iter = avg_score(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val, avg_iter
count+=1
return df
# For the models that don't have n_iter attribute
def avg_score_1(model):
# apply kfold
kf = KFold(n_splits=10)
# create lists to store the results from the different models
score_train = []
score_val = []
timer = []
n_iter = []
for train_index, val_index in kf.split(all_selected_variables):
# get the indexes of the observations assigned for each partition
X_train, X_val = all_selected_variables.iloc[train_index], all_selected_variables.iloc[val_index]
y_train, y_val = target.iloc[train_index], target.iloc[val_index]
# start counting time
begin = time.perf_counter()
# fit the model to the data
model.fit(X_train, y_train)
# finish counting time
end = time.perf_counter()
# check the mean accuracy for the train
value_train = model.score(X_train, y_train)
# check the mean accuracy for the validation
value_val = model.score(X_val,y_val)
# append the accuracies, the time and the number of iterations in the corresponding list
score_train.append(value_train)
score_val.append(value_val)
timer.append(end-begin)
#n_iter.append(model.n_iter_)
# calculate the average and the std for each measure (accuracy, time and number of iterations)
avg_time = round(np.mean(timer),3)
avg_train = round(np.mean(score_train),3)
avg_val = round(np.mean(score_val),3)
std_time = round(np.std(timer),2)
std_train = round(np.std(score_train),2)
std_val = round(np.std(score_val),2)
#avg_iter = round(np.mean(n_iter),1)
#std_iter = round(np.std(n_iter),1)
return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val)
#, str(avg_iter) + '+/-' + str(std_iter)
def show_results_1(df, *args):
"""
Receive an empty dataframe and the different models and call the function avg_score
"""
count = 0
# for each model passed as argument
for arg in args:
# obtain the results provided by avg_score
time, avg_train, avg_val = avg_score_1(arg)
# store the results in the right row
df.iloc[count] = time, avg_train, avg_val
count+=1
return df
def plot_tree(model_tree):
dot_data = export_graphviz(model_tree,
feature_names=X_train.columns,
class_names=["Income lower or equal to avg", "Income higher than avg"],
filled=True)
pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.set_size('"20,20"')
return graphviz.Source(pydot_graph.to_string())
#AUC
def calculate_AUC(interval, x_train, x_val, y_train, y_val, parameter, max_depth = None):
train_results = []
val_results = []
for value in interval:
if (parameter == 'max_depth'):
dt = DecisionTreeClassifier(max_depth = value, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'max_features'):
dt = DecisionTreeClassifier(max_features = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_split'):
dt = DecisionTreeClassifier(min_samples_split = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_samples_leaf'):
dt = DecisionTreeClassifier(min_samples_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_weight_fraction_leaf'):
dt = DecisionTreeClassifier(min_weight_fraction_leaf = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
elif (parameter == 'min_impurity_decrease'):
dt = DecisionTreeClassifier(min_impurity_decrease = value, max_depth = max_depth, random_state=42)
dt.fit(x_train, y_train)
train_pred = dt.predict(x_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous train results
train_results.append(roc_auc)
y_pred = dt.predict(x_val)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Add auc score to previous validation results
val_results.append(roc_auc)
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best validation value is ',interval[value_val])
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerLine2D
line1, = plt.plot(interval, train_results, 'b', label="Train AUC")
line2, = plt.plot(interval, val_results, 'r', label="Validation AUC")
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("AUC score")
plt.xlabel(str(parameter))
plt.show()
Nota: parameters in decision trees don't really improve performance, they're meant to control overfitting
dt_entropy = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train)
dt_gini = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Gini','Entropy'])
show_results_1(df,dt_gini, dt_entropy)
dt_random = DecisionTreeClassifier(splitter = 'random').fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['best','random'])
show_results_1(df,dt_gini, dt_random)
max_depths = np.linspace(1, 15, 15, endpoint=True)
calculate_AUC(max_depths, X_train, X_val, y_train, y_val, 'max_depth')
dt_depth10 = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train)
dt_depth3 = DecisionTreeClassifier(max_depth = 3).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['full','depth10','depth3'])
show_results_1(df,dt_gini, dt_depth10,dt_depth3)
# Quanto maior, mais overfitting! Com 6 é o melhor resultado dos 3 (menos overfitting e maior score no validation)
max_features = list(range(1,len(X_train.columns)))
calculate_AUC(max_features, X_train, X_val, y_train, y_val,'max_features', 10)
# Acho que não é necessário! Pelo gráfico vê-se que daria overfitting e já temos um número aceitável de variáveis
# Para além de que ele escolhe aleatoriamente as variáveis a usar
min_samples_split = list(range(10,600))
calculate_AUC(min_samples_split, X_train, X_val, y_train, y_val,'min_samples_split', 10)
dt_min50 = DecisionTreeClassifier(min_samples_split = 50).fit(X_train, y_train)
dt_min125 = DecisionTreeClassifier(min_samples_split = 125).fit(X_train, y_train)
dt_min323 = DecisionTreeClassifier(min_samples_split = 323).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['dt_min50','dt_min125','dt_min323'])
show_results_1(df, dt_min50, dt_min125, dt_min323)
# Aqui quanto menor, mais overfitting! Com 350 já dá um resultado equilibrado (é o melhor e mais generalizado)
min_samples_leaf = list(range(10,600))
calculate_AUC(min_samples_leaf, X_train, X_val, y_train, y_val,'min_samples_leaf', 10)
dt_min_leaf11 = DecisionTreeClassifier(min_samples_split = 11).fit(X_train, y_train)
dt_min_leaf400 = DecisionTreeClassifier(min_samples_split = 400).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min leaf 11','Min leaf 400'])
show_results_1(df, dt_gini, dt_min_leaf11, dt_min_leaf400)
# mais confuso de chegar a uma conclusão e tem o mesmo efeito que min_samples_split
# Quanto maior -> + underfitting, menor(default)-> full grown tree (overfitting)
# more useful for imbalanced datasets!
min_weight_fraction_leaf = np.linspace(0, 0.3, 250, endpoint=True)
calculate_AUC(min_weight_fraction_leaf, X_train, X_val, y_train, y_val,'min_weight_fraction_leaf', 10)
dt_min_weight_1 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.00361).fit(X_train, y_train)
dt_min_weight_2 = DecisionTreeClassifier(min_weight_fraction_leaf = 0.05).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Min leaf 1','Min weight small','Min weight med'])
show_results_1(df, dt_gini, dt_min_weight_1, dt_min_weight_2)
# Usar diferente de 0.0 já fez diferença!0.05 reduziu o score! Usar 0.02
min_impurity_decrease = np.linspace(0, 0.05, 500, endpoint=True)
calculate_AUC(min_impurity_decrease, X_train, X_val, y_train, y_val,'min_impurity_decrease', 10)
dt_impurity01 = DecisionTreeClassifier(min_impurity_decrease=0.01).fit(X_train, y_train)
dt_impurity0001 = DecisionTreeClassifier(min_impurity_decrease=0.0001).fit(X_train, y_train)
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Baseline','dt_impurity01','dt_impurity0001'])
show_results_1(df,dt_gini, dt_impurity01,dt_impurity0001)
# Melhor é min_impurity_decrease=0.0001!
#ccp_alpha
dt_alpha = DecisionTreeClassifier(random_state=42)
path = dt_alpha.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize = (10,10))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha", fontsize=15)
ax.set_ylabel("total impurity of leaves", fontsize=15)
ax.set_title("Total Impurity vs effective alpha for training set", fontsize=15)
#a função abaixo não aceitava ccp_alphas menores que 0
ccp_alphas=ccp_alphas[ccp_alphas>0]
trees = []
for ccp_alpha in ccp_alphas:
dt_alpha = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha).fit(X_train, y_train)
trees.append(dt_alpha)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(trees[-1].tree_.node_count, ccp_alphas[-1]))
trees = trees[:-1]
ccp_alphas = ccp_alphas[:-1]
train_scores = [tree.score(X_train, y_train) for tree in trees]
val_scores = [tree.score(X_val, y_val) for tree in trees]
fig, ax = plt.subplots(figsize = (10,10))
ax.set_xlabel("alpha", fontsize=15)
ax.set_ylabel("accuracy", fontsize=15)
ax.set_title("Accuracy vs alpha for training and validation sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, val_scores, marker='o', label="validation", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(val_scores)
best_model = trees[index_best_model]
print('ccp_alpha of best model: ',trees[index_best_model])
print('_____________________________________________________________')
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Validation accuracy of best model: ',best_model.score(X_val, y_val))
dt_t1=DecisionTreeClassifier(min_impurity_decrease=0.0001, max_depth = 10,min_samples_split = 323,min_weight_fraction_leaf = 0.00361,random_state=42).fit(X_train, y_train)
dt_t2=DecisionTreeClassifier(max_depth = 10,min_weight_fraction_leaf = 0.00361,random_state=42).fit(X_train, y_train)
dt_t3=DecisionTreeClassifier(min_samples_split = 323,min_weight_fraction_leaf = 0.00361,random_state=42).fit(X_train, y_train)
dt_t4=DecisionTreeClassifier(max_depth = 10, min_samples_split = 323,min_weight_fraction_leaf = 0.00361,random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t1.score(X_train, y_train))
print('Validation accuracy:',dt_t1.score(X_val, y_val))
print('Train accuracy:',dt_t2.score(X_train, y_train))
print('Validation accuracy:',dt_t2.score(X_val, y_val))
print('Train accuracy:',dt_t3.score(X_train, y_train))
print('Validation accuracy:',dt_t3.score(X_val, y_val))
print('Train accuracy:',dt_t4.score(X_train, y_train))
print('Validation accuracy:',dt_t4.score(X_val, y_val))
# Criando ainda a tree dada como melhor pelo ccp_alpha:
dt_t5=DecisionTreeClassifier(ccp_alpha=0.000154, random_state=42).fit(X_train, y_train)
print('Train accuracy:',dt_t5.score(X_train, y_train))
print('Validation accuracy:',dt_t5.score(X_val, y_val))
#changing the threshold improves or not the accuracy......?
threshold = 0.55
predicted_proba = dt_t5.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
# To build the ROC curve
prob_model1 = dt_t1.predict_proba(X_val)
prob_model2 = dt_t2.predict_proba(X_val)
prob_model3 = dt_t3.predict_proba(X_val)
prob_model4 = dt_t4.predict_proba(X_val)
prob_model5 = dt_t5.predict_proba(X_val)
fpr_1, tpr_1, thresholds_1 = roc_curve(y_val, prob_model1[:, 1])
fpr_2, tpr_2, thresholds_2 = roc_curve(y_val, prob_model2[:, 1])
fpr_3, tpr_3, thresholds_3 = roc_curve(y_val, prob_model3[:, 1])
fpr_4, tpr_4, thresholds_4 = roc_curve(y_val, prob_model4[:, 1])
fpr_5, tpr_5, thresholds_5 = roc_curve(y_val, prob_model5[:, 1])
plt.plot(fpr_1, tpr_1, label="ROC Curve model 1")
plt.plot(fpr_2, tpr_2, label="ROC Curve model 2")
plt.plot(fpr_3, tpr_3, label="ROC Curve model 3")
plt.plot(fpr_4, tpr_4, label="ROC Curve model 4")
plt.plot(fpr_5, tpr_5, label="ROC Curve model 5")
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
# deu super semelhante, difícil ver qual o melhor
O melhor é a decision tree 5, com o único parâmetro alterado o ccp_alpha
labels_train = dt_t5.predict(X_train)
labels_val = dt_t5.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Isto é só para ver a complexidade da árvore
print('The "best" tree has a depth of ' + str(dt_t5.get_depth()) + ', ' + str(dt_t5.tree_.node_count) +
' nodes and a total of ' + str(dt_t5.get_n_leaves()) + ' leaves.')
ensemble_clfs = [
("RandomForestClassifier, max_features='auto'",
RandomForestClassifier(oob_score=True,
max_features='auto',
random_state=42)),
("RandomForestClassifier, max_features='log2'",
RandomForestClassifier(max_features='log2',
oob_score=True,
random_state=42)),
("RandomForestClassifier, max_features=7",
RandomForestClassifier(max_features=7,
oob_score=True,
random_state=42)),
("RandomForestClassifier, max_features=None",
RandomForestClassifier(max_features=None,
oob_score=True,
random_state=42))
]
from collections import OrderedDict
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 175 #225
for label, clf in ensemble_clfs:
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(X_train, y_train)
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
# Creating and fitting the models
rf_1 = RandomForestClassifier(n_estimators=140, max_depth=10, random_state = 42).fit(X_train, y_train)
rf_2 = RandomForestClassifier(n_estimators=140, max_depth=10, max_features = 'log2', random_state = 42).fit(X_train, y_train)
rf_3 = RandomForestClassifier(n_estimators=140, max_depth=10, min_samples_split=323, random_state = 42).fit(X_train, y_train)
rf_4= RandomForestClassifier(min_samples_split = 323, min_weight_fraction_leaf = 0.00361,random_state=42).fit(X_train, y_train)
rf_5= RandomForestClassifier(ccp_alpha=0.000154, random_state=42).fit(X_train, y_train)
rf_6= RandomForestClassifier(max_depth = 3, min_weight_fraction_leaf = 0.00361, random_state=42).fit(X_train, y_train)
rf_7= RandomForestClassifier(n_estimators=140, max_depth=3, random_state = 42).fit(X_train, y_train)
rf_8 = RandomForestClassifier(n_estimators=140, max_depth=3, max_features = 'log2', random_state = 42).fit(X_train, y_train)
print('Train accuracy:',rf_1.score(X_train, y_train))
print('Validation accuracy:',rf_1.score(X_val, y_val))
print('Train accuracy:',rf_2.score(X_train, y_train))
print('Validation accuracy:',rf_2.score(X_val, y_val))
print('Train accuracy:',rf_3.score(X_train, y_train))
print('Validation accuracy:',rf_3.score(X_val, y_val))
print('Train accuracy:',rf_4.score(X_train, y_train))
print('Validation accuracy:',rf_4.score(X_val, y_val))
print('Train accuracy:',rf_5.score(X_train, y_train))
print('Validation accuracy:',rf_5.score(X_val, y_val))
print('Train accuracy:',rf_6.score(X_train, y_train))
print('Validation accuracy:',rf_6.score(X_val, y_val))
print('Train accuracy:',rf_7.score(X_train, y_train))
print('Validation accuracy:',rf_7.score(X_val, y_val))
print('Train accuracy:',rf_8.score(X_train, y_train))
print('Validation accuracy:',rf_8.score(X_val, y_val))
models = ['rf_1', 'rf_2', 'rf_3','rf_4','rf_5', 'rf_6', 'rf_7', 'rf_8']
accuracies = [rf_1.score(X_val, y_val), rf_2.score(X_val, y_val), rf_3.score(X_val, y_val), rf_4.score(X_val, y_val),
rf_5.score(X_val, y_val), rf_6.score(X_val, y_val), rf_7.score(X_val, y_val), rf_8.score(X_val, y_val)]
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
plt.bar(data[0], data[1], color='peru')
plt.ylim(0.84, 0.87)
plt.show()
labels_train = rf_2.predict(X_train)
labels_val = rf_2.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
#predict values for X_test, ex: p o citizen em X_test [0] estamos a prever y[0]->0
#changing the threshold does not seem to improve the accuracy of the best random forest!
threshold = 0.5
predicted_proba = rf_2.predict_proba(X_val)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
accuracy = accuracy_score(y_val, predicted)
accuracy
#importing and defining the model
log_model = LogisticRegression(random_state=42)
log_model.fit(X_train,y_train) #fit model to our train data
labels_train = log_model.predict(X_train)
log_model.score(X_train, y_train)
#Predict class labels for samples in X
labels_val = log_model.predict(X_val)
log_model.score(X_val, y_val)
#predict values for X_test, ex: p o citizen em X_test [0] estamos a prever y[0]->0
pred_prob = log_model.predict_proba(X_val)
pred_prob
#o cutoff normalmente é de 0.5, mas as vezes é preferivel considerar menos
X_train.columns
log_model.coef_
#since we dont have the residuals, we cannot use the OLS, not applied to logistic regression
#c estes valores so conseguimos dizer q se for + a curva é p cima, - para baixo
#apagar
#from sklearn.metrics import accuracy_score
#from sklearn.metrics import precision_score #The best value is 1, and the worst value is 0
#from sklearn.metrics import f1_score #F1 score reaches its best value at 1 and worst score at 0
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, log_model)
metrics(y_train, labels_train, y_val, labels_val)
#ability of the classifier to not label a sample as positive if it is negative
#recall: ability of the classifier to find all the positive samples
#bad model: from all the dataset, what are the ones we are getting right
#f1: weighted harmonic mean of the precision and recall
#modelNB = GaussianNB(var=0.001) # train score: 0.8112 validation score: 0.8153
#modelNB = GaussianNB(var_smoothing=0.0001) #train score: 0.8126 validation score: 0.8175
modelNB = GaussianNB() # train score: 0.81996 validation score: 0.82425
modelNB.fit(X = X_train, y = y_train)
labels_train = modelNB.predict(X_train)
labels_val = modelNB.predict(X_val)
modelNB.predict_proba(X_val)
print("train score:", modelNB.score(X_train, y_train))
print("validation score:",modelNB.score(X_val, y_val))
# Para ver unbalancedness, média e variância para cada
print(modelNB.class_prior_) #prob 0, prob 1
print(modelNB.class_count_)#n 0, n 1
# modelNB.theta_
# modelNB.sigma_
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelNB)
metrics(y_train, labels_train, y_val, labels_val)
model = MLPClassifier(random_state=42)
model.fit(X_train, y_train)
labels_train = model.predict(X_train)
labels_val = model.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
losses = model.loss_curve_
iterations = range(model.n_iter_)
sns.lineplot(iterations, losses)
model.loss_
#Get the weight matrix by calling the attribute coefs_:
model.coefs_
#Get the bias vector by calling the attribute intercepts_:
model.intercepts_
# Estamos a usar all_selected_variables min max 0 1. Podemos:
#testar min max -1 1
#testar all_selected_variables robust scaller
model = MLPClassifier(random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model)
model_1 = MLPClassifier(hidden_layer_sizes=(1),random_state=42)
model_2 = MLPClassifier(hidden_layer_sizes=(3),random_state=42)
model_3 = MLPClassifier(hidden_layer_sizes=(9),random_state=42)
model_4 = MLPClassifier(hidden_layer_sizes=(3, 3),random_state=42)
model_5 = MLPClassifier(hidden_layer_sizes=(5, 5),random_state=42)
model_6 = MLPClassifier(hidden_layer_sizes=(3, 3, 3),random_state=42) #3 layers each one with 3 units
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_1','M_2','M_3', 'M_4','M_5','M_6'])
show_results(df, model_1, model_2, model_3, model_4, model_5, model_6)
model_7 = MLPClassifier(hidden_layer_sizes=(4, 4),random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_7'])
show_results(df, model_7)
1, 5, 6, 7 -> overfitting O melhor é o 3 Agora M5 também é muito bom mas um pitz de overfitting-> testar M_3, M_5 e depois M_6
model_logistic = MLPClassifier(activation = 'logistic',random_state=42)
model_tanh = MLPClassifier(activation = 'tanh',random_state=42)
model_relu=MLPClassifier(activation = 'relu',random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['logistic','tanh','relu'])
show_results(df, model_logistic, model_tanh,model_relu)
Logistic is better: same score in less iterations Logistic has less overfitting than tanh, eventhough not significant
Logistic provides a nomalized output between 0 and 1 Logistic provides a nomalized output between -1 and 1
model_lbfgs = MLPClassifier(solver = 'lbfgs',random_state=42) #low dim and sparse data
model_sgd = MLPClassifier(solver = 'sgd',random_state=42) #accuracy > processing time
model_adam = MLPClassifier(solver = 'adam',random_state=42) # big dataset but might fail to converge
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['lbfgs','sgd','adam'])
show_results(df, model_lbfgs, model_sgd, model_adam)
Adam is the best however it has a little overfitting, therefore we will test adam and sgd (less overfitt)
model_constant = MLPClassifier(solver = 'lbfgs', learning_rate = 'constant',random_state=42)
model_invscaling = MLPClassifier(solver = 'lbfgs', learning_rate = 'invscaling',random_state=42)
model_adaptive = MLPClassifier(solver = 'lbfgs', learning_rate = 'adaptive',random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['constant','invscaling','adaptive'])
show_results(df, model_constant, model_invscaling, model_adaptive)
model_adaptive.score(X_val, y_val)
Constant is the better
model_a = MLPClassifier(solver = 'adam', learning_rate_init = 0.5,random_state=42) #qt maior mais rapido aprende o modelo
model_b = MLPClassifier(solver = 'adam', learning_rate_init = 0.1,random_state=42)
model_c = MLPClassifier(solver = 'adam', learning_rate_init = 0.01,random_state=42) #se for mt pequeno pode ficar preso numa solucao subotima e pode nunca convergir
model_d = MLPClassifier(solver = 'adam', learning_rate_init = 0.001,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['M_a','M_b','M_c', "M_d"])
show_results(df, model_a, model_b, model_c, model_d)
The best is 0.01(!!!!)
model_e = MLPClassifier(solver = 'adam', learning_rate_init = 0.005,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ["M_e"])
show_results(df, model_e)
model_batch20 = MLPClassifier(solver = 'sgd', batch_size = 20,random_state=42)
model_batch50 = MLPClassifier(solver = 'sgd', batch_size = 50,random_state=42)
model_batch100 = MLPClassifier(solver = 'sgd', batch_size = 100,random_state=42)
model_batch200 = MLPClassifier(solver = 'sgd', batch_size = 200,random_state=42)
model_batch500 = MLPClassifier(solver = 'sgd', batch_size = 500,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['batch 20','batch 50','batch 100', 'batch 200', 'batch 500'])
show_results(df, model_batch20, model_batch50, model_batch100, model_batch200, model_batch500)
The best one is batch 50
model_maxiter_50 = MLPClassifier(max_iter = 50,random_state=42)
model_maxiter_100 = MLPClassifier(max_iter = 100,random_state=42)
model_maxiter_200 = MLPClassifier(max_iter = 200,random_state=42)
model_maxiter_300 = MLPClassifier(max_iter = 300,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 50','max iter 100','max iter 200', 'max iter 300'])
show_results(df, model_maxiter_50, model_maxiter_100, model_maxiter_200, model_maxiter_300)
model_maxiter_150 = MLPClassifier(max_iter = 150,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 150'])
show_results(df, model_maxiter_150)
model_all=MLPClassifier(hidden_layer_sizes=(4,4),activation = 'tanh',solver = 'lbfgs',batch_size = 50,random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model_all)
# parameter_space = {
# 'hidden_layer_sizes': [(5,5), (4,4)],
# 'activation': ['tanh'],
# 'solver': ['lbfgs'],
# 'batch_size': [(200),(500)],
# 'max_iter': [(50),(100)],
# }
# clf = GridSearchCV(model, parameter_space,n_jobs=-1)
# clf.fit(X_train, y_train)
# clf.best_params_
model_grid=model_all=MLPClassifier(activation= 'tanh', batch_size= 200, hidden_layer_sizes=(4, 4), max_iter= 100, solver= 'lbfgs',random_state=42)
df = pd.DataFrame(columns = ['Time','Train','Validation','Iterations'], index = ['Raw'])
show_results(df, model_grid)
# Best parameter set
print('------------------------------------------------------------------------------------------------------------------------')
print('Best parameters found:\n', clf.best_params_)
print('------------------------------------------------------------------------------------------------------------------------')
# All results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std , params))
Com base na segunda escolha de parâmetros depois de definido o random_state
# parameter_space1 = {
# 'hidden_layer_sizes': [(5,5),(4,4)],
# 'activation': ['tanh'],
# 'solver': ['lbfgs'],
# 'batch_size': [(200),(500)],
# 'max_iter': [(50),(100)],
# }
# clf1 = GridSearchCV(model, parameter_space1,n_jobs=-1)
# clf1.fit(X_train, y_train)
# clf1.best_params_
modelNN_best=model_all=MLPClassifier(activation= 'tanh',batch_size= 200, hidden_layer_sizes= (4, 4),max_iter= 100,solver= 'lbfgs')
df= pd.DataFrame(columns = ['Time','Train','Val', 'Iterations'], index = ['Raw'])
show_results(df, modelNN_best)
# Best parameter set
print('------------------------------------------------------------------------------------------------------------------------')
print('Best parameters found:\n', clf1.best_params_)
print('------------------------------------------------------------------------------------------------------------------------')
# All results
means = clf1.cv_results_['mean_test_score']
stds = clf1.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf1.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std , params))
# Model with best accuracy
labels_train = modelNN_best.predict(X_train)
labels_val = modelNN_best.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
The number K is typically chosen as the square root of the total number of points in the training data set. Thus, in this case, N is 15680, so K = 125.
# try K=50 through K=150 and record testing accuracy
k_range = range(50, 150)
scores = []
# We use a loop through the range
# We append the scores in the list
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_val)
scores.append(accuracy_score(y_val, y_pred))
# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Validation Accuracy')
modelKNN1 = KNeighborsClassifier().fit(X = X_train, y = y_train)
print("train score:", modelKNN1.score(X_train, y_train))
print("validation score:",modelKNN1.score(X_val, y_val))
modelKNN2 = KNeighborsClassifier(n_neighbors=80).fit(X = X_train, y = y_train)
print("train score:", modelKNN2.score(X_train, y_train))
print("validation score:",modelKNN2.score(X_val, y_val))
#from the available algorithms (excluding the default), this was the best one
modelKNN3 = KNeighborsClassifier(n_neighbors=80, algorithm='ball_tree').fit(X = X_train, y = y_train)
print("train score:", modelKNN3.score(X_train, y_train))
print("validation score:",modelKNN3.score(X_val, y_val))
modelKNN4 = KNeighborsClassifier(n_neighbors=80, p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN4.score(X_train, y_train))
print("validation score:",modelKNN4.score(X_val, y_val))
modelKNN5 = KNeighborsClassifier(n_neighbors=80, weights='distance').fit(X = X_train, y = y_train)
print("train score:", modelKNN5.score(X_train, y_train))
print("validation score:",modelKNN5.score(X_val, y_val))
modelKNN6 = KNeighborsClassifier(n_neighbors=80, algorithm='ball_tree', p=1).fit(X = X_train, y = y_train)
print("train score:", modelKNN6.score(X_train, y_train))
print("validation score:",modelKNN6.score(X_val, y_val))
df = pd.DataFrame(columns = ['Time','Train','Validation'], index = ['modelKNN1', 'modelKNN2', 'modelKNN3', 'modelKNN4', 'modelKNN5', 'modelKNN6'])
show_results_1(df, modelKNN1, modelKNN2, modelKNN3, modelKNN4, modelKNN5, modelKNN6)
# Model with best accuracy
labels_train = modelKNN2.predict(X_train)
labels_val = modelKNN2.predict(X_val)
metrics(y_train, labels_train, y_val, labels_val)
# Creating and fitting model
pac_basic = PassiveAggressiveClassifier(random_state=42)
pac_basic.fit(X_train, y_train)
pac_1 = PassiveAggressiveClassifier(C=0.001, fit_intercept=True, tol=1e-2, loss='squared_hinge',random_state=42)
pac_1.fit(X_train, y_train)
pac_2 = PassiveAggressiveClassifier(C=0.001, tol=1e-2, loss='squared_hinge',random_state=42)
pac_2.fit(X_train, y_train)
pac_3 = PassiveAggressiveClassifier(C=0.001, tol=1e-2, random_state=42)
pac_3.fit(X_train, y_train)
# Making prediction on the validation set
val_pred_basic = pac_basic.predict(X_val)
val_pred_1 = pac_1.predict(X_val)
val_pred_2 = pac_2.predict(X_val)
val_pred_3 = pac_3.predict(X_val)
df = pd.DataFrame(columns = ['Time','Train','Validation','Iterations'], index = ['PAC_Basic','PAC_1','PAC_2','PAC_3'])
show_results(df, pac_basic, pac_1, pac_2, pac_3)
labels_train = pac_3.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = pac_3.predict(X_val)
accuracy_score(y_val, labels_val)
metrics(y_train, labels_train, y_val, labels_val)
modelLDA = LinearDiscriminantAnalysis()
modelLDA.fit(X = X_train, y = y_train)
labels_train = modelLDA.predict(X_train)
labels_val = modelLDA.predict(X_val)
modelLDA.predict_proba(X_val)
print("train score:", modelLDA.score(X_train, y_train))
print("validation score:",modelLDA.score(X_val, y_val))
# from sklearn.model_selection import GridSearchCV
# # define grid
# grid = dict()
# grid['solver'] = ['svd', 'lsqr', 'eigen']
# # define search
# search = GridSearchCV(modelLDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
# from numpy import arange
# grid = dict()
# grid['shrinkage'] = arange(0, 1, 0.01)
# grid['solver']=['svd', 'lsqr', 'eigen'] #svd cannot be tested with shrinkage
# # define search
# search = GridSearchCV(modelLDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelLDA_final = LinearDiscriminantAnalysis(solver='lsqr')
modelLDA_final.fit(X = X_train, y = y_train)
labels_train = modelLDA_final.predict(X_train)
labels_val = modelLDA_final.predict(X_val)
print("train score:", modelLDA_final.score(X_train, y_train))
print("validation score:",modelLDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
modelQDA = QuadraticDiscriminantAnalysis()
modelQDA.fit(X = X_train, y = y_train)
labels_train = modelQDA.predict(X_train)
labels_val = modelQDA.predict(X_val)
modelQDA.predict_proba(X_val)
print("train score:", modelQDA.score(X_train, y_train))
print("validation score:",modelQDA.score(X_val, y_val))
# # define grid
# grid = dict()
# grid['reg_param'] = arange(0, 1, 0.01)
# # define search
# search = GridSearchCV(modelQDA, grid, scoring='accuracy', n_jobs=-1)
# # perform the search
# results = search.fit(X_train, y_train)
# # summarize
# print('Mean Accuracy: %.3f' % results.best_score_)
# print('Config: %s' % results.best_params_)
modelQDA_final = QuadraticDiscriminantAnalysis(reg_param=0.4)
modelQDA_final.fit(X = X_train, y = y_train)
labels_train = modelQDA_final.predict(X_train)
labels_val = modelQDA_final.predict(X_val)
print("train score:", modelQDA_final.score(X_train, y_train))
print("validation score:",modelQDA_final.score(X_val, y_val))
metrics(y_train, labels_train, y_val, labels_val)
# # try C=250 through K=1250 and record testing accuracy
# C_range = range(250, 1250)
# scores = []
# #We use a loop through the range
# # We append the scores in the list
# for c in C_range:
# svm = SVC(C=c)
# svm.fit(X_train, y_train)
# y_pred = svm.predict(X_val)
# scores.append(accuracy_score(y_val, y_pred))
# # plot the relationship between C and testing accuracy
# plt.plot(C_range, scores)
# plt.xlabel('Value of C for the SVM')
# plt.ylabel('Validation Accuracy')
modelSVM_basic = SVC().fit(X_train, y_train)
modelSVM_1 = SVC(kernel='linear').fit(X_train, y_train)
modelSVM_2 = SVC(C=1000).fit(X_train, y_train)
modelSVM_3 = SVC(kernel = 'poly').fit(X_train, y_train)
modelSVM_4 = SVC(C=1000, kernel = 'poly').fit(X_train, y_train)
modelSVM_5 = SVC(C=1000, kernel = 'linear').fit(X_train, y_train)
modelSVM_6 = SVC(C=1000, shrinking=False).fit(X_train, y_train)
modelSVM_7 = SVC(C=1000, tol=1e-2).fit(X_train, y_train)
accuracies = [modelSVM_basic.score(X_val, y_val), modelSVM_1.score(X_val, y_val),
modelSVM_2.score(X_val, y_val), modelSVM_3.score(X_val, y_val),
modelSVM_4.score(X_val, y_val), modelSVM_5.score(X_val, y_val),
modelSVM_6.score(X_val, y_val), modelSVM_7.score(X_val, y_val)]
models = ['modelSVM_basic', 'modelSVM_1', 'modelSVM_2', 'modelSVM_3',
'modelSVM_4', 'modelSVM_5', 'modelSVM_6', 'modelSVM_7']
data_tuples = list(zip(models,accuracies))
data = pd.DataFrame(data_tuples)
data = data.sort_values(1)
plt.bar(data[0], data[1], color='peru')
plt.xticks(rotation=90)
plt.ylim(0.80,0.86)
plt.show()
# highest accuracy from the SVMs
modelSVM_1.score(X_val, y_val)
pred_train_svm = modelSVM_1.predict(X_train)
pred_val_svm = modelSVM_1.predict(X_val)
metrics(y_train, pred_train_svm, y_val, pred_val_svm)
def calculate_f1(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = AdaBoostClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = AdaBoostClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
num_estimators = list(range(1,100))
calculate_f1(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
learning_rate = list(np.arange(0.01, 2, 0.05))
calculate_f1(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
# AdaBoost = AdaBoostClassifier()
# AdaBoost_parameters = {'base_estimator' : [None, modelNB, modelQDA_final, pac_1, modelLDA_final],
# 'n_estimators' : list(range(1,100)),
# 'learning_rate' : np.arange(0.5, 1.5, 0.05),
# 'algorithm' : ['SAMME', 'SAMME.R']}
# AdaBoost_grid = GridSearchCV(estimator=AdaBoost, param_grid=AdaBoost_parameters,
# scoring='accuracy', verbose=1, n_jobs=-1)
# AdaBoost_grid.fit(X_train , y_train)
# AdaBoost_grid.best_params_
modelAdaBoost = AdaBoostClassifier(base_estimator=None, n_estimators=94, learning_rate=1.2500000000000007, algorithm='SAMME.R', random_state=42)
modelAdaBoost.fit(X_train,y_train)
labels_train = modelAdaBoost.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelAdaBoost.predict(X_val)
accuracy_score(y_val, labels_val)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Raw'])
show_results_1(df, modelAdaBoost)
metrics(y_train, labels_train, y_val, labels_val)
# Código para fazer as previsões no test!
# robust = RobustScaler()
# robust_scaled= robust.fit_transform(test.values)
# test= pd.DataFrame(robust_scaled, columns=test.columns, index=test.index)
# Citizen=df_test['CITIZEN_ID']
# labels_test= modelAdaBoost.predict(test)
# prediction=pd.concat([Citizen, pd.DataFrame(labels_test)],axis=1)
# prediction['Income']=prediction[0]
# prediction.drop(columns=0,inplace=True)
# prediction.to_csv(r'C:\Users\matip\Documents\Mestrado\Machine Learning\Project\Proj\Predictions\Pred8.csv',index=False, header=True,sep=',')
def calculate_f1_2(interval, x_train, x_val, y_train, y_val, parameter):
train_results = []
val_results = []
for value in interval:
if parameter == 'Number of estimators':
dt = GradientBoostingClassifier(n_estimators = value, random_state = 5)
elif parameter == 'Learning Rate':
dt = GradientBoostingClassifier(learning_rate = value, random_state = 5)
dt.fit(x_train, y_train)
train_results.append(f1_score(y_train,dt.predict(x_train)))
val_results.append(f1_score(y_val,dt.predict(x_val)))
value_train = train_results.index(max(train_results))
value_val = val_results.index(max(val_results))
print('The best train value is ',interval[value_train])
print('The best val value is ',interval[value_val])
fig = plt.figure(figsize = (16,10))
line1, = plt.plot(interval, train_results, '#515C60', label="Train F1", linewidth=3,color='peru')
line2, = plt.plot(interval, val_results, '#C7DC1F', label="Val F1", linewidth=3,color='b')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel("F1 score")
plt.xlabel(str(parameter))
plt.show()
learning_rate = list(np.arange(0.05, 1.5, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
learning_rate = list(np.arange(0.05, 0.6, 0.05))
calculate_f1_2(learning_rate, X_train, X_val, y_train, y_val,'Learning Rate')
num_estimators = list(np.arange(1, 200, 10))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(np.arange(150, 300, 10))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(np.arange(100, 500, 50))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
num_estimators = list(np.arange(300, 550, 20))
calculate_f1_2(num_estimators, X_train, X_val, y_train, y_val,'Number of estimators')
modelGBauto = GradientBoostingClassifier(max_features='auto', random_state=42)
modelGBlog = GradientBoostingClassifier(max_features='log2',random_state=42)
modelGBsqrt = GradientBoostingClassifier(max_features='sqrt',random_state=42)
modelGBnone = GradientBoostingClassifier(max_features=None,random_state=42)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['Auto','Log2','Sqrt','None/Raw'])
show_results_1(df, modelGBauto, modelGBlog, modelGBsqrt, modelGBnone)
modelGBdev = GradientBoostingClassifier(loss='deviance', random_state=42)
modelGBexp = GradientBoostingClassifier(loss='exponential',random_state=42)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['deviance','exponential'])
show_results_1(df, modelGBdev, modelGBexp)
modelGB2 = GradientBoostingClassifier(max_depth=2, random_state=5)
modelGB3 = GradientBoostingClassifier(max_depth=3,random_state=5)
modelGB10 = GradientBoostingClassifier(max_depth=10,random_state=5)
modelGB30 = GradientBoostingClassifier(max_depth=30,random_state=5)
modelGB50 = GradientBoostingClassifier(max_depth=50,random_state=5)
df= pd.DataFrame(columns = ['Time','Train','Validation'], index = ['model2','model3','model10','model30','model50'])
show_results_1(df, modelGB2, modelGB3,modelGB10,modelGB30,modelGB50)
# GB_clf = GradientBoostingClassifier()
# GB_parameters = {'loss' : [ 'deviance','exponential'],
# 'learning_rate' : np.arange(0.3, 0.6, 0.05),
# 'n_estimators' : np.arange(400, 500, 10),
# 'max_depth' : np.arange(2, 10, 1),
# 'max_features' : ['auto', None]
# }
# GB_grid = GridSearchCV(estimator=GB_clf, param_grid=GB_parameters, scoring='accuracy', verbose=1, n_jobs=-1)
# GB_grid.fit(X_train , y_train)
# GB_grid.best_params_
modelGB = GradientBoostingClassifier(learning_rate=0.35, loss='deviance', max_depth=2, max_features='auto',
n_estimators=460, random_state=5)
modelGB.fit(X_train, y_train)
labels_train = modelGB.predict(X_train)
accuracy_score(y_train, labels_train)
labels_val = modelGB.predict(X_val)
accuracy_score(y_val, labels_val)
metrics(y_train, labels_train, y_val, labels_val)
# Código para fazer as previsões no test!
# robust = RobustScaler()
# robust_scaled= robust.fit_transform(test.values)
# test= pd.DataFrame(robust_scaled, columns=test.columns, index=test.index)
# Citizen=df_test['CITIZEN_ID']
# labels_test= modelGB.predict(test)
# prediction=pd.concat([Citizen, pd.DataFrame(labels_test)],axis=1)
# prediction['Income']=prediction[0]
# prediction.drop(columns=0,inplace=True)
# prediction.to_csv(r'C:\Users\matip\Documents\Mestrado\Machine Learning\Project\Proj\Predictions\Pred9.csv',index=False, header=True,sep=',')